-
Notifications
You must be signed in to change notification settings - Fork 885
Conversation
BTW, this tries to enable a "sane" seccomp whitelist by default. It looks like most of the testsuite is ok with that, except for some buggy interactions with non-root app (which so far I failed to trace down, thus the WIP). |
seccompIsolators++ | ||
// By appc spec, only one seccomp isolator per app is allowed | ||
if seccompIsolators > 1 { | ||
return nil, fmt.Errorf("too many seccomp isolators specified (%d)", seccompIsolators) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%d will always be 2 because we return from the function before incrementing to 3 or more.
Probably reached the end of the tunnel: since we drop |
637dbd9
to
839d356
Compare
"arch_prctl", | ||
"modify_ldt", | ||
"chroot", | ||
"clone", // this is args-filtered by docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does args-filtered mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seccomp filters (in kernel) allows to match on syscall arguments too, but seccomp filters (in systemd) do not. The note is there as we are dropping the args-filtering part originally present in docker.
839d356
to
b7633d2
Compare
Still blocked on appc, but the non-root case is fine now and whitelisting by default doesn't trip anything in the testsuite. |
Would this apply the seccomp whitelist on a per-container basis? If so perhaps functionality to load a new default seccomp configuration at runtime bootup, instead of having to load the same one per container would be more useful. |
@grantseltzer Yes, each container can have its own seccomp whitelist. Are you concerned about the time taken to setup the seccomp, or the memory used by the seccomp program? I assume that both should be very small. Also, rkt does not have a daemon where the containers are started from, so there is noruntime bootup. |
@alban I misspoke when I said runtime, I certainly am more familiar with docker's architecture than rkt's. My concern was more about convenience. If someone wanted to start 15 containers, all with the same seccomp whitelist, it would be easier to set a default in a central location instead of loading the whitelist each time you start a container. I still need to familiarize myself more with Rkt's architecture but could this be accomplished at the stage 1 level? |
@grantseltzer with this PR, the whitelist cannot be given on the command line but is specified in the image manifest (in the image tarball). Users don't have to type the whitelist 15 times when they run 15 containers. So I think it is fine for convenience. |
@alban Ah, gotcha. Thanks for clearing that up. I'll do more research before I start making suggestions next time :) |
I landed the appc/spec change; what's next with this one? |
@jonboulle a tagged bump on appc/spec, to be serialized (perhaps after) with conflicting #2735. I'm reworking this with appc changes as we speak. |
@@ -273,6 +273,36 @@ func generatePodManifest(cfg PrepareConfig, dir string) ([]byte, error) { | |||
ra.App.Isolators = append(ra.App.Isolators, *isolator) | |||
} | |||
|
|||
// Override seccomp isolators via --seccomp CLI switch | |||
if seccomp := app.SeccompFilter; seccomp != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not app.SeccompFilter != ""
?
I was doing some tests and found that trying to run Running rkt with
@lucab guessed that libseccomp is too old in coreos-stage1 and it seems to be the cause, it's 2.1.1 which was released in October 2013 (!). |
libseccomp 2.1.1 doesn't have getrandom support, thus this failure. The situation with systemd output is even more unfortunate though, as systemd doesn't distinguish between invalid and pseudo syscalls (both returning negative values): https://github.com/systemd/systemd/blob/master/src/core/load-fragment.c#L2432 |
EDIT: already there in latest alpha |
A fix for systemd seccomp behavior also landed in v231: systemd/systemd#3701 |
Any chances of getting this for v1.12.0 or do we want to wait for v231? |
1825365
to
7619f9f
Compare
@iaguis rebased to address your review, PTAL. |
This commit introduces support for appc seccomp isolators. Two isolatore are currently specified and implemented: a "retain-set" mode (whitelist) and a "remove-set" one (blacklist). If no seccomp isolator is specified, rkt will apply its own default whitelist.
7619f9f
to
b641f38
Compare
I wonder if we should not pass the (pseudo)syscalls not recognized by the libseccomp we use in the coreos flavor so users don't get warning messages with
|
One question, LGTM otherwise. Do we wanna include the docs here or in a new PR? |
The debug noise should go away with the systemd patch above (ie. once we get v231), so I'd prefer not to touch the default set. Regarding doc, I'll queue up a separate PR. |
Fair enough |
This commit introduces support for appc seccomp isolators.
Two isolators are currently specified and implemented, a "retain-set"
mode (whitelist) and a "remove-set" one (blacklist).
If no seccomp isolators are specified, rkt will apply its own default
whitelist.