Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroups: add pids controller support #58

Merged
merged 4 commits into from
Dec 19, 2015
Merged

cgroups: add pids controller support #58

merged 4 commits into from
Dec 19, 2015

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jun 27, 2015

Add support for the pids cgroup controller, a recent feature that is
(see: will be) available in Linux 4.3.

Closes #382

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@cyphar
Copy link
Member Author

cyphar commented Jun 28, 2015

@@ -49,6 +49,11 @@ type MemoryStats struct {
Stats map[string]uint64 `json:"stats,omitempty"`
}

type PidsStats struct {
// current counter usage
Current uint64 `json:"current,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPids? We should also document that it is the number of pids in the cgroup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it matches the name of the sysfs file. But it doesn't really matter I guess.

@vmarmol
Copy link
Contributor

vmarmol commented Jun 29, 2015

I think we also need to make some changes for the systemd portion.

@cyphar
Copy link
Member Author

cyphar commented Jun 30, 2015

@vmarmol Yes, quite. I'll update it in a few hours.

@cyphar
Copy link
Member Author

cyphar commented Jun 30, 2015

Fixed up the issues (it also wouldn't build properly because I didn't fix the imports from the old ones).

/cc @crosbymichael @vmarmol @mrunalp @LK4D4

@lizf-os
Copy link
Contributor

lizf-os commented Jul 1, 2015

The change to systemd part doesn't look sufficient. You also need to add some changes to Apply().

@mrunalp
Copy link
Contributor

mrunalp commented Jul 1, 2015

@cyphar You will have to modify the Apply function in apply_systemd.go as @lizf-os pointed out. I suspect that systemd doesn't yet support this, so you might want take the approach that joinCpu function takes in that file.

@cyphar
Copy link
Member Author

cyphar commented Jul 8, 2015

We also need to update the spec if we want support in config.json. I've proposed opencontainers/runtime-spec#64 to deal with that. Ignore the build errors until it's merged.

@cyphar
Copy link
Member Author

cyphar commented Sep 2, 2015

The PIDs changes have been merged into Linus' tree as of torvalds/linux@8bdc69b. I'll update this patch in a bit (if it needs updating).

@cyphar
Copy link
Member Author

cyphar commented Sep 21, 2015

Since #242 has been merged, it's probably time to get around to reviewing this.

/ping @crosbymichael @vmarmol @mrunalp @LK4D4

@mrunalp
Copy link
Contributor

mrunalp commented Sep 21, 2015

I will take a look tomorrow

Sent from my iPhone

On Sep 20, 2015, at 10:02 PM, Aleksa Sarai notifications@github.com wrote:

Since #242 has been merged, it's probably time to get around to reviewing this.

/ping @crosbymichael @vmarmol @mrunalp @LK4D4


Reply to this email directly or view it on GitHub.

@crosbymichael
Copy link
Member

code, LGTM

@mrunalp
Copy link
Contributor

mrunalp commented Sep 24, 2015

I am just testing this and noticed that we need to set pids.max to 4 for runc to be able to spawn the container.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 24, 2015

@cyphar Can we delay applying the setting so that we can set it to lower values like 1 or 2 for the exec'd process?

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 12, 2015

ping @cyphar
need rebase

@@ -57,6 +57,11 @@ type Cgroup struct {
// MEM to use
CpusetMems string `json:"cpuset_mems"`

// Process limit; set to `0' to disable limit.
// While technically `0' is a valid PID limit, it does not make sense in the
// context of a container -- it is identical to a limit of `1'.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But SPEC say

// Maximum number of PIDs. A value < 0 implies "no limit".
Limit int64 `json:"limit"`

Please consider base on the PR #369

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, that should read <= 0. Whoops. I'll send a PR to fix the comment when I get a chance. The problem with 0 being a valid limit is that it makes the spec non-backwards compatible (and you have to specify it in config.json) ... unless the change in #369 changes the default?

-- not to mention the fact that a limit of 0 in a pids cgroup doesn't mean anything different to a limit of 1. Since attaches aren't blocked by cgroup core, you need to have at least one process in the cgroup in order for the limits to affect anything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar I think limit of 0 means don't allow the process in container to fork() or clone(). That's really a reasonable scenario I think. Yes, there is at lease 1 process in container, that said, the pids.current is always >= 1. But that does not prevent us to set the pids.max to 0, where I want to prevent fork() and clone() in container.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yangdongsheng But ... it doesn't actually have a practical meaning. Having a limit of 0 is precisely identical to having a limit of 1. Both prevent fork() and clone() in a container with a single process in it. There's no special code path that the pids controller goes through if the limit is specifically 0 (ref: I wrote it). I do understand that it might seem nice to have a limit of zero to specify "this container should never have any processes in it", but I'm not sure if the resource limiting part of the spec is the right way to go (I'd rather have container types as first-class citizens).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think make 0 being a valid limit is necessary, since a cgroup without any processes is meaningless, and we agree that there would always be at least 1 process in container, so set pids.max=1 would be enough to prevent any fork() or clone() in container.

And take the default value of the type (such as 0 of int, null of mapping etc..) as the invalid value is always what we do, except some special types such as bool (like cgroup.OomKillDisable). So I prefer take 0 as the invalid value, and change specs instead. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar So, the reason sounds like following the convention in runc, right? But opencontainers/runtime-spec#233 is trying to set the default value to -1.

And I think pids cgroup is a supporter for opencontainers/runtime-spec#233 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think pids cgroup is a supporter for opencontainers/runtime-spec#233 .

@yangdongsheng I don't understand why you would think that. You can't set a value of -1 to mean max in the PIDs controller ...?

I also don't agree with opencontainers/runtime-spec#233 (I just commented with my reservations). I'd be okay with it if we change all of the limits to be pointers (so you can set null to mean default, or omit the option to also mean default) and then get consumers to deal with the default values explicitly (by checking for null).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar Okey, on my second thought, I think idea of "making Limit = 0 valid" from myself would also not reduce the confusion from user. Let's wait for the conclusion in ML about removal cgroup and opencontainers/runtime-spec#233 . For now, I would say, Limit <= 0 measn max is good enough. :)

But I would read the discussion in cgroup to see why we have to let attaching of pids cgroup break pids.max, if I got some time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yangdongsheng ... pids.max isn't broken. In what way do you think it's broken? It's discussed here: http://thread.gmane.org/gmane.linux.kernel.cgroups/13292.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar Yes, I see that's intentional. But I just want to read more about the discussion of it to see what did TJ said about it. Thanx for the reference, will open and read it later. :)

@LK4D4
Copy link
Contributor

LK4D4 commented Nov 16, 2015

@cyphar So, 0 is just do nothing(take from parent cgroup?), negative - write max, positive - write number? Sounds reasonable for me. Ping @mrunalp @crosbymichael

@cyphar
Copy link
Member Author

cyphar commented Nov 17, 2015

@LK4D4 I'd prefer that we wait until opencontainers/runtime-spec#233 is decided on. The current state is that 0 is do nothing (which actually means max, since that's the default) and negatives are explicit max. But I prefer the discussed solution in opencontainers/runtime-spec#233 (use null for system default and all other values are written).

@cyphar
Copy link
Member Author

cyphar commented Dec 8, 2015

@mrunalp Sorry for the late response. Do you know where runc is creating new processes that we shouldn't limit? Also, I'd really like opencontainers/runtime-spec#233 to be fixed up and merged so we can use the semantics <= 0 means max, > 0 means "use this limit" and null means don't touch the cgroup.

@dqminh
Copy link
Contributor

dqminh commented Dec 8, 2015

Do you know where runc is creating new processes that we shouldn't limit

@cyphar I think its here https://github.com/opencontainers/runc/blob/master/libcontainer/process_linux.go#L196-L200

So basically we run a bootstrap process, put the bootstrap process into correct cgroups and then start the actual process. So its likely that limit=1 will not work ?

@cyphar
Copy link
Member Author

cyphar commented Dec 8, 2015

@dqminh Oh, I see. I can see two solutions to this problem:

  • We tell people not to use PidsLimit = 1 because it doesn't work (bad).
  • We increment the PidsLimit by 1 (or whatever the correct number is) such that the number refers to the number of processes inside the container. This is better, but it does mean that you'll have an off-by-one error in the number of pids in the container in the long-term. Unfortunately, we can't execve the bootstrapped process because it needs to fork to apply the namespace configs.

I've implemented the second, but I'm not sure if it's the best solution. Full disclosure: I'm not currently on a system where I can test that it works (I'm currently compiling the kernel).

@dqminh
Copy link
Contributor

dqminh commented Dec 8, 2015

@cyphar yep, both options are bad unfortunately :(

Transparently "Increment pids limit by 1" is actually worse imo because it's only used for bootstrap, also it doesnt correspond to what user asked for. Temporarily increase then decrease the value may not work as well because then we have race condition.

Not allow "PidsLimit=1" seems just a bit more reasonable, just because i dont think its possible to really allow it given how the container is bootstrapped.

WDYT ? @crosbymichael @LK4D4 @mrunalp @vishh @avagin

@cyphar
Copy link
Member Author

cyphar commented Dec 8, 2015

@dqminh I'd like to point out that an off-by-one isn't that bad, because in the context of PID resource exhaustion one PID is not going to make a significant difference. But I do agree that it is not the best solution (and that incrementing then decrementing will have a race). We can't get the bootstrapping process to change the limit, and we don't have control over the final process running in the container. Is there any "callback" when the bootstrapping is complete?

@mrunalp
Copy link
Contributor

mrunalp commented Dec 8, 2015

I am okay with having the setting be some minimum other than 1.

for _, sys := range subsystems {
// We can't set this here, because after being applied, memcg doesn't
// allow a non-empty cgroup from having its limits changed.
if sys.Name() == "memory" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is true? Did it change in a kernel version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crosbymichael No, this hasn't changed (see 39279b1), it's a specific issue with kernel memory limiting. This is the case for kmemcg limits. Since the kmemcg limits are integrated into the memory cgroup, I couldn't think of a "nice" way of only setting half of the options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can make this change now or in this pr because updating memory limits, not kernel limits, is a big usecase and this just stops it. Maybe can we keep the changes small and only do the pid change in this pr and update the other things later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

On Fri, Dec 18, 2015 at 10:22 AM, Michael Crosby notifications@github.com
wrote:

In libcontainer/cgroups/fs/apply_raw.go
#58 (comment):

@@ -179,11 +180,24 @@ func (m _Manager) GetStats() (_cgroups.Stats, error) {
}

func (m *Manager) Set(container *configs.Config) error {

  • for name, path := range m.Paths {
  •   sys, err := subsystems.Get(name)
    
  •   if err == errSubsystemDoesNotExist || !cgroups.PathExists(path) {
    
  • for _, sys := range subsystems {
  •   // We can't set this here, because after being applied, memcg doesn't
    
  •   // allow a non-empty cgroup from having its limits changed.
    
  •   if sys.Name() == "memory" {
    

I don't think we can make this change now or in this pr because updating
memory limits, not kernel limits, is a big usecase and this just stops it.
Maybe can we keep the changes small and only do the pid change in this pr
and update the other things later?


Reply to this email directly or view it on GitHub
https://github.com/opencontainers/runc/pull/58/files#r48053452.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crosbymichael

This is how this code has always worked. The only change I've made is separating .Set() and .Apply(). The previous code explicitly did this this way (I actually haven't modified the Apply() method for MemoryGroup). In other words, I'm not making any change to how the MemoryGroup works.

Memory limits are still being set, they're just being set in MemoryGroup.Apply() rather than MemoryGroup.Set().

And we can't "only do the pid change in this PR" because we need to Set() the cgroup value as late as possible in order for us to use the PIDs cgroup for all legal values. Sure, we could only do late setting for only the PIDs cgroup, but that's just ugly and there are other problems that this solves for other cgroups (the late setting is a common requirement for cgroups).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info @cyphar. LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially the workflow is as follows:

  1. Create cgroups
  2. Set the appropriate limits
  3. Apply cgroups to the init process.

As of now 1 and 3 are essentially together. We should clean that up soon.

@crosbymichael
Copy link
Member

LGTM

continue
}

// Get the subsystem path, but don't fial out for not found cgroups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: fial

Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.

Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.

Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to
group the two together (especially considering that the bootstrap
process is added to the cgroup as well). The only exception to this is
the memory cgroup, which requires the configuration to be set before
processes can join.

Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit
(rather than using the system default). Otherwise the user will assume
they have security they do not actually have. This mirrors the original
Apply() (that would set cgroup configs) semantics.

Signed-off-by: Aleksa Sarai <asarai@suse.com>
Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.

Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).

Signed-off-by: Aleksa Sarai <asarai@suse.com>
@vishh
Copy link
Contributor

vishh commented Dec 19, 2015

LGTM. Nice work @cyphar 👍

@cyphar
Copy link
Member Author

cyphar commented Dec 19, 2015

Please merge this before we vendor specs to include opencontainers/runtime-spec#233, as it would cause quite a few merge conflicts.

I've fixed up your comment nit, and explained the test nit. PTAL. 🐳 /cc @mrunalp

@mrunalp
Copy link
Contributor

mrunalp commented Dec 19, 2015

@cyphar you mentioned an issue with systemd. Is that resolved?

@cyphar
Copy link
Member Author

cyphar commented Dec 19, 2015

@mrunalp I just ran the tests on my local machine (which has systemd). All of the tests that pass on master pass on my branch. I think we're ready to go. I was mistaken about the systemd issue.

@mrunalp
Copy link
Contributor

mrunalp commented Dec 19, 2015

@cyphar Thanks! 👍 LGTM

mrunalp pushed a commit that referenced this pull request Dec 19, 2015
cgroups: add pids controller support
@mrunalp mrunalp merged commit bc46574 into opencontainers:master Dec 19, 2015
@jessfraz
Copy link
Contributor

\o/

On Fri, Dec 18, 2015 at 7:55 PM, Mrunal Patel notifications@github.com
wrote:

Merged #58 #58.


Reply to this email directly or view it on GitHub
#58 (comment).

@cyphar
Copy link
Member Author

cyphar commented Dec 19, 2015

w00t w00t. 🐳

@cyphar cyphar deleted the 18-add-pids-controller branch December 19, 2015 04:09
@mrunalp
Copy link
Contributor

mrunalp commented Dec 19, 2015

@cyphar Good work.. now carry it on to Docker :)

@cyphar
Copy link
Member Author

cyphar commented Dec 19, 2015

@mrunalp I'm sorry. The tests don't actually check that the limits work, which mislead me to believe they worked as expected. I can't seem to test how well systemd runs, because it looks my build of runc won't use the systemd code (even after setting cgroupsPath to the systemd slice).

However, from what I can see, the limits systemd doesn't support are broken (we never join them). I'm making a patch to try to fix this, but I can't test it (and the automated tests don't actually test that limits fail).

@mrunalp
Copy link
Contributor

mrunalp commented Dec 19, 2015

@mrunalp
Copy link
Contributor

mrunalp commented Dec 19, 2015

Looks like there will be difference in behavior between fs and systemd implementation around what cgroup is set when. I am okay with the differences as long as the final outcome is the same. But others may disagree. I am reverting this PR till will we get the systemd support to work correctly and then we can merge it again. Sorry @cyphar , my bad as well. @crosbymichael @LK4D4 @hqhq Let me know what you think.

@cyphar
Copy link
Member Author

cyphar commented Dec 20, 2015

@mrunalp I'm very confused. I've tried to run using systemd on both openSUSE 13.2 and Arch Linux, and (even though both systems run systemd and the kernels have full cgroup support) runc panics with a null pointer dereference on libcontainer/cgroups/systemd/apply_systemd.go:218. This implies that systemd isn't supported?

@mrunalp
Copy link
Contributor

mrunalp commented Dec 20, 2015

@cyphar I will have limited time to debug that before Monday. Meanwhile, could you try your patch directly using docker?

docker daemon --exec-opt native.cgroupdriver=systemd

@cyphar
Copy link
Member Author

cyphar commented Dec 20, 2015

@mrunalp @crosbymichael @hqhq @dqminh @lizf-os I just found out that, actually, I didn't break the memory cgroup (as it was broken before then). If you try to run with the memory cgroup, blkio or the cpu cgroup, it will crash with an error about "no such file or directory" (meaning that the path hasn't had MkdirAll run on it). A lot of these issues appear to stem from deeper issues of getSubsystemPath not working as it should.

The kernel memory cgroup is the most brazen example, where the code doesn't crash, but the limit is not set (this is the case for master on Docker right now). I want to know why we even support systemd as a driver for cgroups, because we bypass it for most of the cgroups anyway. And from what I can see, it's barely tested.

I've fixed the problem as best I can in my new PR (#446). But I really have doubts about how stable our systemd support really is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.