Investigating user namespaces #986

tixxdz · 2015-06-03T14:49:04Z

In this issue we try to investigate user namespaces support in rkt.

For the moment unprivileged user namespaces are out of scope.

Summary of tasks:

kernel uidshifts at mount time rkt: kernel: uidshifts at mount time #1057
- proof of concept for tmpfs
- the rest of the implementation
uid shift with chown without kernel support ~~rkt: initial user namespace support test by doing the uidshift at extract time #1027~~ rkt: user namespace support through chown at extract time #1250
- functional test: files belong to the correct user/group inside and outside the container
- functional test: root directory belongs to the correct user/group (with RW access) Tests for user namespaces (--private-users) #1531
dynamic uid-locking scheme implement a uid locking scheme for user namespaces #1090

Problems

1) CAP_SYS_ADMIN and cgroup filesystems:

rkt pods have CAP_SYS_ADMIN this may reduce rkt's security: containers could remount cgroups in read-write mode and since cgroups are not namespaced, containers can change cgroup settings for services on the host.
Getting rid of CAP_SYS_ADMIN is difficult with the current architecture. Files in the cgroup filesystems are writable only by root (in the system slice) or by a specific user (in user slices). If the root user and other users from the host are not mapped in the user namespace of the container, it becomes a non-issue.
please see this issue "stage1: rkt pods should not be given CAP_SYS_ADMIN" #576

TODO: check that the cgroup filesystem access rights behave as described as above when running in a user namespace.

2) Separation and isolation

2.1) Capabilities:

user namespace aware kernel interfaces are handled with the ns_capable() check against the current userns. We may take advantage of this isolation to give some capabilities to container X and give other capabilities to container Y, at the same these capabilities may not be effective on the host in other words the init_userns. This gives advantage to allow some file system capabilities on a specific container for its internal operations without affecting other containers or even the host.

Please note that some kernel interface still use the plain old capable() check, if the check succeed then the caps are effective globally.

2.2) Per-user limits:

Each user on the system has its own “struct user_struct” to count resources (processes, signals, etc.) When several pods are used they share the same user, any operation on a pod may affect other pods. To improve separation and add an extra layer of resource isolation we may use user namespace and assign a range of global kuid_t to pod X and assign another range to pod Y, this may improve the situation and prevent some pods from DoS'ing each other.

Please note that kuid_t is not uid_t. kuid_t is the global kernel UID used to identify process's credentials.

What we are trying to do:

1) User namespace mapping:

rkt will use only 1 level of user namespace. The schema above is just to illustrate the general concept of user namespaces.

Each running pod will have a range of uid assigned. For example, pod1 will have uids 200000 to 200999 (mapped to 0-999) and pod2 will have 201000 to 201999 (mapped to 0-999).
rkt should not reuse uids assigned to other system users (e.g. don’t reuse www-data, geoclue or sshd users!). rkt will be assigned a big range of uids it is allowed to use in /etc/subuid (see manpage subuid(5) and useradd(8)). Example: rkt:200000:65536.

2) User namespace locking:

Since rkt does not have a central daemon to assign uids within the global range allowed by /etc/subuid, we will need some rkt-specific locking to avoid multiple running pods to reuse the same uids. Example: pod1 will lock on /run/rkt/uid-locks/uid-200000; pod2 will lock on /run/rkt/uid-locks/uid-201000. Or maybe something smarter to express the range.

Challenges / things to check:

1) uid-shifts for rootfs:

1.1) At extract time:

the same ACI could be used in several running pods. The rootfs trees are cached in the CAS (Content Addressable Storage) in /var/lib/rkt/cas/ and have specific uid/gid owners and changing them all (recursive chown) is too costly. We could shift uids on the fly at extract time, then it would not be so costly.

1.2) At mount time:

When a rootfs tree is used by overlayfs, we would need some vfs_uid= shifting option.
This was already mention in this lwn article "UID/GID identity and filesystems" http://lwn.net/Articles/637431/
https://lists.linux-foundation.org/pipermail/containers/2014-June/034630.html

volumes bind-mounting: similar issue

2) Network namespace:

On Linux, a network namespace belongs to the user namespace under which it was created, see field user_ns in struct net: http://lxr.free-electrons.com/source/include/net/net_namespace.h#L44.

Rkt currently creates the network namespace in network plugins before systemd-nspawn is launched. If the user namespace is to be created by systemd-nspawn, it means that the network namespace will belong to the host user namespace rather than the pod user namespace. AFAIU, it has unwanted consequences on the access to /sys/class/net/. See also https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=87a8ebd637dafc255070f503909a053cf0d98d3f

3) Bind mounting:

Bind mounting a socket file from the host to the container? E.g. sdnotify socket (https://bugs.freedesktop.org/show_bug.cgi?id=89844).

4) Passing file descriptors:

Passing fds in the other direction from the container to the host? E.g. the journal dirfd that @krnowak is working on #947.

Model

Our model may touch several layers from rkt, nspawn to the kernel

1) kernel: uidshift on the rootfs

As noted in the Challenges section, we may add a uidshift mount option for containers but also to implement dynamic uids for services. This may allow running unprivileged daemons with dynamically assigned uids without leaking into the persistent file system.

Implementation:

1.1) A generic vfs mount option ?

mount(source, target, “bind”, MS_BIND|MS_REMOUNT, vfs_uidshifts)

1.2) An overlayfs mount option or a completely new overlayfs-like fs

Some file systems like NFS are already doing some mapping.

TODO: continue the investigation.

Thanks to @alban for his help on the subject.

iaguis · 2015-06-04T13:59:05Z

I tested adding the argument --private users to systemd-nspawn in rkt and running it with --no-overlay.

It then failed in prepare-app when bind-mounting /sys from stage1 to stage2 with EINVAL. I looked a bit on the internet and found http://stackoverflow.com/questions/23417521/mounting-proc-in-non-privileged-namespace-sandbox. Adding MS_REC to the bind-mount did the trick.

Then I had to chmod -R user:group $CONTAINER_ROOT/stage1/rootfs/rkt/env and chmod -R user:group $CONTAINER_ROOT/stage1/rootfs/opt/stage2 because they were owned by root and then I could successfully start a container with user-namespaces!

iaguis    1305  0.0  0.0  19052  7052 pts/2    Ss   14:40   0:00  \_ /bin/bash
root      9883  0.0  0.0  69992  5060 pts/2    S+   15:30   0:00  |   \_ sudo rkt --debug --insecure-skip-verify run -no-overlay -interactive coreos.com/etcd:v2.0.0
root      9884  5.1  0.0  17288  2372 pts/2    S+   15:30   0:02  |       \_ stage1/rootfs/usr/bin/systemd-nspawn --boot --register=true --link-journal=host --private-users=1000:65534 --uuid=f629d909-d811-472e-b254-b427cdafc8db --machine=rkt-f629d909-d811-472e-b254-b427cdafc8db --directory=stage1/rootfs -- --default-standard-output=tty
iaguis    9907  0.0  0.0  25180  3828 ?        Ss   15:30   0:00  |           \_ /usr/lib/systemd/systemd --default-standard-output=tty
iaguis    9915  0.0  0.0   1024     4 ?        Ss   15:30   0:00  |               \_ /waiter
iaguis    9916  0.0  0.0  19064  4084 ?        Ss   15:30   0:00  |               \_ /usr/lib/systemd/systemd-journald
iaguis    9929  0.2  0.0   9512  6324 pts/8    Ssl+ 15:30   0:00  |               \_ /etcd

Notice that systemd and children are owned by my user.

You can test it with: https://github.com/endocode/rkt/tree/iaguis/test-userns

To do the chmod I just put a sleep and do it manually but when we extract the image we already do a chmod so shifting the uid there should work.

tixxdz · 2015-06-04T18:34:41Z

@iaguis thx for the test, actually we will have to investigate the recursive mount flag, where did you add the MS_REC flag ?

Taking a look at nspawn it seems that currently /sys is not mounted when userns is set.

iaguis · 2015-06-04T18:39:15Z

endocode@e4880ff

We tested with: sudo bin/rkt --debug --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci The 10000:65536 is hardcoded in a couple of place for now...

tixxdz · 2015-06-10T15:45:56Z

More updates on the model and how things should work from rkt perspective.

Model

2) rkt and uid-shifts for rootfs

It seems we want rkt to automatically detect if user namespace is supported and if so just use it, otherwise perhaps if the user sets some (in)compatible flags then in this case we have to follow what the user requested. This gives us the following scenarios:

2.1) Automatic support and uid-shifts at mount time

The long term plan is to automatically pass the user namespace options "--private-users" through the runtime manifest to systemd-nspawn. To do this we have to check: CONFIG_USER_NS in kernel + the non implemented yet uid-shifting during mount time in the kernel + if rkt has its own global range of UIDs in /etc/subuid. This allows to automatically set the "--private-users" in nspawn.

2.2) Automatic support and uid-shifts at extract time.

If the -no-overlay flag was set + CONFIG_USER_NS in kernel + rkt has its own global range of UIDs in /etc/subuid, then fallback to shift UIDs at extract time and automatically set the "--private-users" in nspawn.

We are already experimenting with this, please see this PR: #1027

2.3) Add "--private-users=[UIDBASE[:NUIDS]]" flag to rkt.

In this case rkt will take another flag to specify the UID-base and UID-range to pass to nspawn in case we don't have /etc/subuid. We will fallback to this method only if the -no-overlay was set and if the kernel supports CONFIG_USER_NS. The UIDs shifting will be done during extract time like in 2.2)

We are already experimenting with this, please see the previous noted PR: #1027

marineam · 2015-06-16T16:50:57Z

I find it incredulously odd that recursive mout magically makes it work unless it is a security measure to prevent unprivileged containers from exposing something that is covered by a nested mount. I'd have to read the kernel code to see what is going on I suppose.

Alternatively, why bind mount instead of mounting /sys fresh? Recursive is going to pull in debugfs which probably shouldn't be in the container and the cgroup hierarchy which nspawn will set up different mounts for.

vcaputo · 2015-06-16T22:09:19Z

I suppose for the overlayfs uid-shift case we'd need overlayfs to selectively shift the [ug]ids on inodes from the lower layer and simply pass-through the upper layer inode [ug]ids? Presumably we don't want containers to be able to create files owned by the host's uid 0 via the mapping applying to the upper layer.

alban · 2015-06-16T22:44:50Z

I find it incredulously odd that recursive mout magically makes it work unless it is a security measure to prevent unprivileged containers from exposing something that is covered by a nested mount. I'd have to read the kernel code to see what is going on I suppose.

It seems to be exactly that, see do_mount/do_loopback

2020         if (!recurse && has_locked_children(old, old_path.dentry))
2021                 goto out2;

Alternatively, why bind mount instead of mounting /sys fresh? Recursive is going to pull in debugfs which probably shouldn't be in the container and the cgroup hierarchy which nspawn will set up different mounts for.

The bind mount of /sys we mentioned here is the bind mount from stage1 to stage2, so both the source and the target of the bind mount are in the container. So this should not pull debugfs from the host.

The /sys in stage1 is a fresh mount and debugfs is not requested in the list of fresh mounts.

Since recently, cgroups are mounted in stage1 by rkt with our own choice of read-only/read-write options. nspawn would normally mount it but it explicitely skips that step when it notices it is already mounted. Similarly, systemd in stage1 would mount it if not already mounted.

alban · 2015-06-16T23:16:53Z

I suppose for the overlayfs uid-shift case we'd need overlayfs to selectively shift the [ug]ids on inodes from the lower layer and simply pass-through the upper layer inode [ug]ids? Presumably we don't want containers to be able to create files owned by the host's uid 0 via the mapping applying to the upper layer.

Interesting question... If the mapping also applies to the upper directory, the container will not be able to create files owned by the host's uid 0 in there (that's a good thing I also guess?) because the host's uid 0 will not be mapped in the container. But then, we'll need to find a way to persist & restore the mapping if container restarts (#551) get implemented because the mapping will be dynamic.

If the mapping does not apply to the upper directory, we cannot prevent files owned by the host's uid 0 to be created (unless the upper directory is itself bind mounted over itself with another [ug]idshift option) but we can remount the overlayfs with a different mapping after a container restart.

/cc @dvdhrm

tixxdz · 2015-06-17T08:42:53Z

The followup issue to track the kernel uidshift at mount time is here: #1057

tixxdz · 2015-06-22T16:17:48Z

Just an update on this issue.

It seems that we will not use the subordinate uid file /etc/subuid to handle UIDs and their range. The new plan is to let systemd machined do the work for us. I have opened this new systemd issue "User namespace: allow machined to handle UIDs and ranges" systemd/systemd#321

If we go this way, this will allow us to not worry at all about UIDs and ranges mapping or locking in rkt. systemd will do the work for us.

Please if you have any comment on this do not hesitate, this allow us partially to solve the "What we are trying to do:" section of this investigation in a clean way.

Thank you!

We tested with: sudo bin/rkt --debug --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci The 10000:65536 is hardcoded in a couple of place for now...

Tested both with "rkt run" and "rkt prepare && rkt run-prepared". The uid range is not hardcoded anymore.

We tested with: sudo bin/rkt --debug --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci The 10000:65536 is hardcoded in a couple of place for now...

Signed-off-by: Djalal Harouni <djalal@endocode.com>

Do the userns chown at extract time. With the help of: Alban Crequy <alban@endocode.com> Simone Gotti <simone.gotti@gmail.com> Krzesimir Nowak <krzesimir@endocode.com> Iago López Galeiras <iago@endocode.com> Signed-off-by: Djalal Harouni <djalal@endocode.com>

tixxdz · 2015-08-17T14:09:16Z

An update, the "uid shift with chown without kernel support" task was completed and merged. For the record it was PR #1250

For this same task we will add some functional tests later and perhaps do some cleaning. The user namespace support is currently marked experimental.

Thanks!

lenucksi · 2016-03-12T15:00:32Z

👍

jonboulle added the kind/question label Jun 3, 2015

jonboulle added this to the v1.0.0 milestone Jun 3, 2015

tixxdz mentioned this issue Jun 10, 2015

rkt: initial user namespace support test by doing the uidshift at extract time #1027

Closed

tixxdz mentioned this issue Jun 17, 2015

rkt: kernel: uidshifts at mount time #1057

Open

tixxdz self-assigned this Jun 22, 2015

tixxdz mentioned this issue Jun 22, 2015

machined should allow dynamic allocation of transient UID ranges for userns support in container managers such as nspawn systemd/systemd#321

Closed

alban mentioned this issue Jun 29, 2015

implement a uid locking scheme for user namespaces #1090

Open

alban added a commit to endocode/rkt that referenced this issue Jun 30, 2015

userns: chown as suggested on rkt#986

14ea774

Tested both with "rkt run" and "rkt prepare && rkt run-prepared". The uid range is not hardcoded anymore.

alban mentioned this issue Jul 1, 2015

stage1:prepare-app: add support for mount points options #1095

Closed

tixxdz added the depends-on/external label Jul 15, 2015

philips added the technology/userns label Jul 22, 2015

tixxdz pushed a commit to endocode/rkt that referenced this issue Aug 4, 2015

rkt:userns: chown as suggested on rkt#986

132cf3d

Signed-off-by: Djalal Harouni <djalal@endocode.com>

tixxdz mentioned this issue Aug 6, 2015

rkt: user namespace support through chown at extract time #1250

Merged

glerchundi mentioned this issue Aug 26, 2015

Support USER in Dockerfile - when container starts up non-root just-containers/s6-overlay#19

Closed

steveej mentioned this issue Sep 19, 2015

*: shared namespace execution modes #1433

Open

jonboulle unassigned tixxdz Sep 19, 2015

jonboulle added the help wanted label Sep 19, 2015

jonboulle assigned vcaputo Oct 1, 2015

jonboulle modified the milestones: v1+, v1.0.0 Jan 22, 2016

lucab unassigned vcaputo Apr 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating user namespaces #986

Investigating user namespaces #986

tixxdz commented Jun 3, 2015

iaguis commented Jun 4, 2015

tixxdz commented Jun 4, 2015

iaguis commented Jun 4, 2015

tixxdz commented Jun 10, 2015

marineam commented Jun 16, 2015

vcaputo commented Jun 16, 2015

alban commented Jun 16, 2015

alban commented Jun 16, 2015

tixxdz commented Jun 17, 2015

tixxdz commented Jun 22, 2015

tixxdz commented Aug 17, 2015

lenucksi commented Mar 12, 2016

Investigating user namespaces #986

Investigating user namespaces #986

Comments

tixxdz commented Jun 3, 2015

Problems

1) CAP_SYS_ADMIN and cgroup filesystems:

2) Separation and isolation

2.1) Capabilities:

2.2) Per-user limits:

What we are trying to do:

1) User namespace mapping:

2) User namespace locking:

Challenges / things to check:

1) uid-shifts for rootfs:

1.1) At extract time:

1.2) At mount time:

2) Network namespace:

3) Bind mounting:

4) Passing file descriptors:

Model

1) kernel: uidshift on the rootfs

1.1) A generic vfs mount option ?

1.2) An overlayfs mount option or a completely new overlayfs-like fs

iaguis commented Jun 4, 2015

tixxdz commented Jun 4, 2015

iaguis commented Jun 4, 2015

tixxdz commented Jun 10, 2015

Model

2) rkt and uid-shifts for rootfs

2.1) Automatic support and uid-shifts at mount time

2.2) Automatic support and uid-shifts at extract time.

2.3) Add "--private-users=[UIDBASE[:NUIDS]]" flag to rkt.

marineam commented Jun 16, 2015

vcaputo commented Jun 16, 2015

alban commented Jun 16, 2015

alban commented Jun 16, 2015

tixxdz commented Jun 17, 2015

tixxdz commented Jun 22, 2015

tixxdz commented Aug 17, 2015

lenucksi commented Mar 12, 2016