cgroups: add pids controller support #58

cyphar · 2015-06-27T23:58:09Z

Add support for the pids cgroup controller, a recent feature that is
(see: will be) available in Linux 4.3.

Closes #382

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

cyphar · 2015-06-28T00:45:46Z

/cc @crosbymichael @vmarmol @mrunalp @LK4D4

vmarmol · 2015-06-29T18:02:47Z

libcontainer/cgroups/stats.go

@@ -49,6 +49,11 @@ type MemoryStats struct {
 	Stats       map[string]uint64 `json:"stats,omitempty"`
 }

+type PidsStats struct {
+	// current counter usage
+	Current uint64 `json:"current,omitempty"`


NumPids? We should also document that it is the number of pids in the cgroup

I prefer it matches the name of the sysfs file. But it doesn't really matter I guess.

vmarmol · 2015-06-29T18:03:17Z

I think we also need to make some changes for the systemd portion.

cyphar · 2015-06-30T07:47:29Z

@vmarmol Yes, quite. I'll update it in a few hours.

cyphar · 2015-06-30T15:51:46Z

Fixed up the issues (it also wouldn't build properly because I didn't fix the imports from the old ones).

/cc @crosbymichael @vmarmol @mrunalp @LK4D4

lizf-os · 2015-07-01T01:47:54Z

The change to systemd part doesn't look sufficient. You also need to add some changes to Apply().

mrunalp · 2015-07-01T23:25:22Z

@cyphar You will have to modify the Apply function in apply_systemd.go as @lizf-os pointed out. I suspect that systemd doesn't yet support this, so you might want take the approach that joinCpu function takes in that file.

cyphar · 2015-07-08T10:20:14Z

We also need to update the spec if we want support in config.json. I've proposed opencontainers/runtime-spec#64 to deal with that. Ignore the build errors until it's merged.

cyphar · 2015-09-02T23:58:44Z

The PIDs changes have been merged into Linus' tree as of torvalds/linux@8bdc69b. I'll update this patch in a bit (if it needs updating).

cyphar · 2015-09-21T05:02:30Z

Since #242 has been merged, it's probably time to get around to reviewing this.

/ping @crosbymichael @vmarmol @mrunalp @LK4D4

mrunalp · 2015-09-21T05:21:31Z

I will take a look tomorrow

Sent from my iPhone

On Sep 20, 2015, at 10:02 PM, Aleksa Sarai notifications@github.com wrote:

Since #242 has been merged, it's probably time to get around to reviewing this.

/ping @crosbymichael @vmarmol @mrunalp @LK4D4

—
Reply to this email directly or view it on GitHub.

crosbymichael · 2015-09-23T23:18:50Z

code, LGTM

mrunalp · 2015-09-24T18:52:28Z

I am just testing this and noticed that we need to set pids.max to 4 for runc to be able to spawn the container.

mrunalp · 2015-09-24T19:20:11Z

@cyphar Can we delay applying the setting so that we can set it to lower values like 1 or 2 for the exec'd process?

LK4D4 · 2015-10-12T17:58:44Z

ping @cyphar
need rebase

yangdongsheng · 2015-10-27T06:03:50Z

libcontainer/configs/cgroup.go

@@ -57,6 +57,11 @@ type Cgroup struct {
 	// MEM to use
 	CpusetMems string `json:"cpuset_mems"`

+	// Process limit; set to `0' to disable limit.
+	// While technically `0' is a valid PID limit, it does not make sense in the
+	// context of a container -- it is identical to a limit of `1'.


But SPEC say

// Maximum number of PIDs. A value < 0 implies "no limit". Limit int64 `json:"limit"`

Please consider base on the PR #369

Gah, that should read <= 0. Whoops. I'll send a PR to fix the comment when I get a chance. The problem with 0 being a valid limit is that it makes the spec non-backwards compatible (and you have to specify it in config.json) ... unless the change in #369 changes the default?

-- not to mention the fact that a limit of 0 in a pids cgroup doesn't mean anything different to a limit of 1. Since attaches aren't blocked by cgroup core, you need to have at least one process in the cgroup in order for the limits to affect anything.

@cyphar I think limit of 0 means don't allow the process in container to fork() or clone(). That's really a reasonable scenario I think. Yes, there is at lease 1 process in container, that said, the pids.current is always >= 1. But that does not prevent us to set the pids.max to 0, where I want to prevent fork() and clone() in container.

@yangdongsheng But ... it doesn't actually have a practical meaning. Having a limit of 0 is precisely identical to having a limit of 1. Both prevent fork() and clone() in a container with a single process in it. There's no special code path that the pids controller goes through if the limit is specifically 0 (ref: I wrote it). I do understand that it might seem nice to have a limit of zero to specify "this container should never have any processes in it", but I'm not sure if the resource limiting part of the spec is the right way to go (I'd rather have container types as first-class citizens).

I don't think make 0 being a valid limit is necessary, since a cgroup without any processes is meaningless, and we agree that there would always be at least 1 process in container, so set pids.max=1 would be enough to prevent any fork() or clone() in container.

And take the default value of the type (such as 0 of int, null of mapping etc..) as the invalid value is always what we do, except some special types such as bool (like cgroup.OomKillDisable). So I prefer take 0 as the invalid value, and change specs instead. WDYT?

@cyphar So, the reason sounds like following the convention in runc, right? But opencontainers/runtime-spec#233 is trying to set the default value to -1.

And I think pids cgroup is a supporter for opencontainers/runtime-spec#233 .

And I think pids cgroup is a supporter for opencontainers/runtime-spec#233 .

@yangdongsheng I don't understand why you would think that. You can't set a value of -1 to mean max in the PIDs controller ...?

I also don't agree with opencontainers/runtime-spec#233 (I just commented with my reservations). I'd be okay with it if we change all of the limits to be pointers (so you can set null to mean default, or omit the option to also mean default) and then get consumers to deal with the default values explicitly (by checking for null).

@cyphar Okey, on my second thought, I think idea of "making Limit = 0 valid" from myself would also not reduce the confusion from user. Let's wait for the conclusion in ML about removal cgroup and opencontainers/runtime-spec#233 . For now, I would say, Limit <= 0 measn max is good enough. :)

But I would read the discussion in cgroup to see why we have to let attaching of pids cgroup break pids.max, if I got some time.

@yangdongsheng ... pids.max isn't broken. In what way do you think it's broken? It's discussed here: http://thread.gmane.org/gmane.linux.kernel.cgroups/13292.

@cyphar Yes, I see that's intentional. But I just want to read more about the discussion of it to see what did TJ said about it. Thanx for the reference, will open and read it later. :)

LK4D4 · 2015-11-16T17:50:43Z

@cyphar So, 0 is just do nothing(take from parent cgroup?), negative - write max, positive - write number? Sounds reasonable for me. Ping @mrunalp @crosbymichael

cyphar · 2015-11-17T01:12:45Z

@LK4D4 I'd prefer that we wait until opencontainers/runtime-spec#233 is decided on. The current state is that 0 is do nothing (which actually means max, since that's the default) and negatives are explicit max. But I prefer the discussed solution in opencontainers/runtime-spec#233 (use null for system default and all other values are written).

cyphar · 2015-12-08T11:35:58Z

@mrunalp Sorry for the late response. Do you know where runc is creating new processes that we shouldn't limit? Also, I'd really like opencontainers/runtime-spec#233 to be fixed up and merged so we can use the semantics <= 0 means max, > 0 means "use this limit" and null means don't touch the cgroup.

dqminh · 2015-12-08T12:55:52Z

Do you know where runc is creating new processes that we shouldn't limit

@cyphar I think its here https://github.com/opencontainers/runc/blob/master/libcontainer/process_linux.go#L196-L200

So basically we run a bootstrap process, put the bootstrap process into correct cgroups and then start the actual process. So its likely that limit=1 will not work ?

cyphar · 2015-12-08T15:11:25Z

@dqminh Oh, I see. I can see two solutions to this problem:

We tell people not to use PidsLimit = 1 because it doesn't work (bad).
We increment the PidsLimit by 1 (or whatever the correct number is) such that the number refers to the number of processes inside the container. This is better, but it does mean that you'll have an off-by-one error in the number of pids in the container in the long-term. Unfortunately, we can't execve the bootstrapped process because it needs to fork to apply the namespace configs.

I've implemented the second, but I'm not sure if it's the best solution. Full disclosure: I'm not currently on a system where I can test that it works (I'm currently compiling the kernel).

dqminh · 2015-12-08T15:18:11Z

@cyphar yep, both options are bad unfortunately :(

Transparently "Increment pids limit by 1" is actually worse imo because it's only used for bootstrap, also it doesnt correspond to what user asked for. Temporarily increase then decrease the value may not work as well because then we have race condition.

Not allow "PidsLimit=1" seems just a bit more reasonable, just because i dont think its possible to really allow it given how the container is bootstrapped.

WDYT ? @crosbymichael @LK4D4 @mrunalp @vishh @avagin

cyphar · 2015-12-08T15:21:51Z

@dqminh I'd like to point out that an off-by-one isn't that bad, because in the context of PID resource exhaustion one PID is not going to make a significant difference. But I do agree that it is not the best solution (and that incrementing then decrementing will have a race). We can't get the bootstrapping process to change the limit, and we don't have control over the final process running in the container. Is there any "callback" when the bootstrapping is complete?

mrunalp · 2015-12-08T18:17:05Z

I am okay with having the setting be some minimum other than 1.

crosbymichael · 2015-12-18T01:43:59Z

libcontainer/cgroups/fs/apply_raw.go

+	for _, sys := range subsystems {
+		// We can't set this here, because after being applied, memcg doesn't
+		// allow a non-empty cgroup from having its limits changed.
+		if sys.Name() == "memory" {


Are we sure this is true? Did it change in a kernel version?

@crosbymichael No, this hasn't changed (see 39279b1), it's a specific issue with kernel memory limiting. This is the case for kmemcg limits. Since the kmemcg limits are integrated into the memory cgroup, I couldn't think of a "nice" way of only setting half of the options.

I don't think we can make this change now or in this pr because updating memory limits, not kernel limits, is a big usecase and this just stops it. Maybe can we keep the changes small and only do the pid change in this pr and update the other things later?

+1

On Fri, Dec 18, 2015 at 10:22 AM, Michael Crosby notifications@github.com
wrote:

In libcontainer/cgroups/fs/apply_raw.go
#58 (comment):

@@ -179,11 +180,24 @@ func (m _Manager) GetStats() (_cgroups.Stats, error) {
}

func (m *Manager) Set(container *configs.Config) error {

for name, path := range m.Paths {

sys, err := subsystems.Get(name)

if err == errSubsystemDoesNotExist || !cgroups.PathExists(path) {

for _, sys := range subsystems {

// We can't set this here, because after being applied, memcg doesn't

// allow a non-empty cgroup from having its limits changed.

if sys.Name() == "memory" {

I don't think we can make this change now or in this pr because updating
memory limits, not kernel limits, is a big usecase and this just stops it.
Maybe can we keep the changes small and only do the pid change in this pr
and update the other things later?

—
Reply to this email directly or view it on GitHub
https://github.com/opencontainers/runc/pull/58/files#r48053452.

@crosbymichael

This is how this code has always worked. The only change I've made is separating .Set() and .Apply(). The previous code explicitly did this this way (I actually haven't modified the Apply() method for MemoryGroup). In other words, I'm not making any change to how the MemoryGroup works.

Memory limits are still being set, they're just being set in MemoryGroup.Apply() rather than MemoryGroup.Set().

And we can't "only do the pid change in this PR" because we need to Set() the cgroup value as late as possible in order for us to use the PIDs cgroup for all legal values. Sure, we could only do late setting for only the PIDs cgroup, but that's just ugly and there are other problems that this solves for other cgroups (the late setting is a common requirement for cgroups).

ping @crosbymichael @vishh

Thanks for the info @cyphar. LGTM

Essentially the workflow is as follows:

Create cgroups

Set the appropriate limits

Apply cgroups to the init process.

As of now 1 and 3 are essentially together. We should clean that up soon.

crosbymichael · 2015-12-18T23:46:58Z

LGTM

mrunalp · 2015-12-19T00:09:10Z

libcontainer/cgroups/systemd/apply_systemd.go

 			continue
 		}
+
+		// Get the subsystem path, but don't fial out for not found cgroups.


Add support for the pids cgroup controller to libcontainer, a recent feature that is available in Linux 4.3+. Unfortunately, due to the init process being written in Go, it can spawn an an unknown number of threads due to blocked syscalls. This results in the init process being unable to run properly, and thus small pids.max configs won't work properly. Signed-off-by: Aleksa Sarai <asarai@suse.com>

Apply and Set are two separate operations, and it doesn't make sense to group the two together (especially considering that the bootstrap process is added to the cgroup as well). The only exception to this is the memory cgroup, which requires the configuration to be set before processes can join. Signed-off-by: Aleksa Sarai <asarai@suse.com>

It is vital to loudly fail when a user attempts to set a cgroup limit (rather than using the system default). Otherwise the user will assume they have security they do not actually have. This mirrors the original Apply() (that would set cgroup configs) semantics. Signed-off-by: Aleksa Sarai <asarai@suse.com>

Due to the fact that the init is implemented in Go (which seemingly randomly spawns new processes and loves eating memory), most cgroup configurations are required to have an arbitrary minimum dictated by the init. This confuses users and makes configuration more annoying than it should. An example of this is pids.max, where Go spawns multiple processes that then cause init to violate the pids cgroup constraint before the container can even start. Solve this problem by setting the cgroup configurations as late as possible, to avoid hitting as many of the resources hogged by the Go init as possible. This has to be done before seccomp rules are applied, as the parent and child must synchronise in order for the parent to correctly set the configurations (and writes might be blocked by seccomp). Signed-off-by: Aleksa Sarai <asarai@suse.com>

vishh · 2015-12-19T01:17:09Z

LGTM. Nice work @cyphar 👍

cyphar · 2015-12-19T02:01:03Z

Please merge this before we vendor specs to include opencontainers/runtime-spec#233, as it would cause quite a few merge conflicts.

I've fixed up your comment nit, and explained the test nit. PTAL. 🐳 /cc @mrunalp

mrunalp · 2015-12-19T02:55:15Z

@cyphar you mentioned an issue with systemd. Is that resolved?

cyphar · 2015-12-19T03:50:00Z

@mrunalp I just ran the tests on my local machine (which has systemd). All of the tests that pass on master pass on my branch. I think we're ready to go. I was mistaken about the systemd issue.

mrunalp · 2015-12-19T03:55:48Z

@cyphar Thanks! 👍 LGTM

cgroups: add pids controller support

jessfraz · 2015-12-19T04:00:15Z

\o/

On Fri, Dec 18, 2015 at 7:55 PM, Mrunal Patel notifications@github.com
wrote:

Merged #58 #58.

—
Reply to this email directly or view it on GitHub
#58 (comment).

cyphar · 2015-12-19T04:08:59Z

w00t w00t. 🐳

mrunalp · 2015-12-19T04:11:54Z

@cyphar Good work.. now carry it on to Docker :)

cyphar · 2015-12-19T06:29:54Z

@mrunalp I'm sorry. The tests don't actually check that the limits work, which mislead me to believe they worked as expected. I can't seem to test how well systemd runs, because it looks my build of runc won't use the systemd code (even after setting cgroupsPath to the systemd slice).

However, from what I can see, the limits systemd doesn't support are broken (we never join them). I'm making a patch to try to fix this, but I can't test it (and the automated tests don't actually test that limits fail).

mrunalp · 2015-12-19T15:21:24Z

@cyphar You can use this to test systemd cgroup support https://github.com/opencontainers/runc/compare/master...mrunalp:systemd-cgroups?expand=1

mrunalp · 2015-12-19T15:40:31Z

Looks like there will be difference in behavior between fs and systemd implementation around what cgroup is set when. I am okay with the differences as long as the final outcome is the same. But others may disagree. I am reverting this PR till will we get the systemd support to work correctly and then we can merge it again. Sorry @cyphar , my bad as well. @crosbymichael @LK4D4 @hqhq Let me know what you think.

cyphar · 2015-12-20T01:42:05Z

@mrunalp I'm very confused. I've tried to run using systemd on both openSUSE 13.2 and Arch Linux, and (even though both systems run systemd and the kernels have full cgroup support) runc panics with a null pointer dereference on libcontainer/cgroups/systemd/apply_systemd.go:218. This implies that systemd isn't supported?

mrunalp · 2015-12-20T02:18:02Z

@cyphar I will have limited time to debug that before Monday. Meanwhile, could you try your patch directly using docker?

docker daemon --exec-opt native.cgroupdriver=systemd

cyphar · 2015-12-20T06:24:00Z

@mrunalp @crosbymichael @hqhq @dqminh @lizf-os I just found out that, actually, I didn't break the memory cgroup (as it was broken before then). If you try to run with the memory cgroup, blkio or the cpu cgroup, it will crash with an error about "no such file or directory" (meaning that the path hasn't had MkdirAll run on it). A lot of these issues appear to stem from deeper issues of getSubsystemPath not working as it should.

The kernel memory cgroup is the most brazen example, where the code doesn't crash, but the limit is not set (this is the case for master on Docker right now). I want to know why we even support systemd as a driver for cgroups, because we bypass it for most of the cgroups anyway. And from what I can see, it's barely tested.

I've fixed the problem as best I can in my new PR (#446). But I really have doubts about how stable our systemd support really is.

…piness Add memory swappiness to linux spec

cyphar mentioned this pull request Jun 28, 2015

cgroups: fs: fix issues with .Apply() #60

Closed

vmarmol reviewed Jun 29, 2015
View reviewed changes

cyphar mentioned this pull request Jul 8, 2015

spec: linux: add support for the PIDs cgroup opencontainers/runtime-spec#64

Merged

cyphar mentioned this pull request Sep 5, 2015

Adjust runc to new opencontainers/specs version #242

Merged

yangdongsheng reviewed Oct 27, 2015
View reviewed changes

yangdongsheng mentioned this pull request Oct 27, 2015

Add support for pids cgroup moby/moby#17395

Closed

cyphar mentioned this pull request Oct 28, 2015

config: linux: update description of PidsLimit opencontainers/runtime-spec#234

Merged

mrunalp mentioned this pull request Nov 2, 2015

Wrap Linux 4.3 pids cgroup #382

Closed

crosbymichael reviewed Dec 18, 2015
View reviewed changes

cyphar mentioned this pull request Dec 18, 2015

Update gcc to 5.3 moby/moby#18751

Merged

mrunalp reviewed Dec 19, 2015
View reviewed changes

cyphar added 4 commits December 19, 2015 11:30

mrunalp pushed a commit that referenced this pull request Dec 19, 2015

Merge pull request #58 from cyphar/18-add-pids-controller

bc46574

cgroups: add pids controller support

mrunalp merged commit bc46574 into opencontainers:master Dec 19, 2015

cyphar deleted the 18-add-pids-controller branch December 19, 2015 04:09

cyphar mentioned this pull request Dec 20, 2015

cgroup: add PIDs cgroup controller support #446

Merged

cyphar mentioned this pull request Mar 15, 2016

libcontainer: cgroups: deal with unlimited case for pids.max #644

Merged

stefanberger pushed a commit to stefanberger/runc that referenced this pull request Sep 8, 2017

Merge pull request opencontainers#58 from lizf-os/support-memory-swap…

554ea25

…piness Add memory swappiness to linux spec

thaJeztah mentioned this pull request Nov 11, 2017

Request: Swarm Mode 17.10-ce: Cannot allocate memory moby/moby#35469

Closed

cgroups: add pids controller support #58

cgroups: add pids controller support #58

Conversation

cyphar commented Jun 27, 2015

cyphar commented Jun 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vmarmol commented Jun 29, 2015

cyphar commented Jun 30, 2015

cyphar commented Jun 30, 2015

lizf-os commented Jul 1, 2015

mrunalp commented Jul 1, 2015

cyphar commented Jul 8, 2015

cyphar commented Sep 2, 2015

cyphar commented Sep 21, 2015

mrunalp commented Sep 21, 2015

crosbymichael commented Sep 23, 2015

mrunalp commented Sep 24, 2015

mrunalp commented Sep 24, 2015

LK4D4 commented Oct 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LK4D4 commented Nov 16, 2015

cyphar commented Nov 17, 2015

cyphar commented Dec 8, 2015

dqminh commented Dec 8, 2015

cyphar commented Dec 8, 2015

dqminh commented Dec 8, 2015

cyphar commented Dec 8, 2015

mrunalp commented Dec 8, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crosbymichael commented Dec 18, 2015

Choose a reason for hiding this comment

vishh commented Dec 19, 2015

cyphar commented Dec 19, 2015

mrunalp commented Dec 19, 2015

cyphar commented Dec 19, 2015

mrunalp commented Dec 19, 2015

jessfraz commented Dec 19, 2015

cyphar commented Dec 19, 2015

mrunalp commented Dec 19, 2015

cyphar commented Dec 19, 2015

mrunalp commented Dec 19, 2015

mrunalp commented Dec 19, 2015

cyphar commented Dec 20, 2015

mrunalp commented Dec 20, 2015

cyphar commented Dec 20, 2015