Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

Investigating user namespaces #986

Open
1 of 7 tasks
tixxdz opened this issue Jun 3, 2015 · 12 comments
Open
1 of 7 tasks

Investigating user namespaces #986

tixxdz opened this issue Jun 3, 2015 · 12 comments

Comments

@tixxdz
Copy link
Contributor

tixxdz commented Jun 3, 2015

In this issue we try to investigate user namespaces support in rkt.

For the moment unprivileged user namespaces are out of scope.

Summary of tasks:

Problems

1) CAP_SYS_ADMIN and cgroup filesystems:

rkt pods have CAP_SYS_ADMIN this may reduce rkt's security: containers could remount cgroups in read-write mode and since cgroups are not namespaced, containers can change cgroup settings for services on the host.
Getting rid of CAP_SYS_ADMIN is difficult with the current architecture. Files in the cgroup filesystems are writable only by root (in the system slice) or by a specific user (in user slices). If the root user and other users from the host are not mapped in the user namespace of the container, it becomes a non-issue.
please see this issue "stage1: rkt pods should not be given CAP_SYS_ADMIN" #576

TODO: check that the cgroup filesystem access rights behave as described as above when running in a user namespace.

2) Separation and isolation

2.1) Capabilities:

user namespace aware kernel interfaces are handled with the ns_capable() check against the current userns. We may take advantage of this isolation to give some capabilities to container X and give other capabilities to container Y, at the same these capabilities may not be effective on the host in other words the init_userns. This gives advantage to allow some file system capabilities on a specific container for its internal operations without affecting other containers or even the host.

Please note that some kernel interface still use the plain old capable() check, if the check succeed then the caps are effective globally.

2.2) Per-user limits:

Each user on the system has its own “struct user_struct” to count resources (processes, signals, etc.) When several pods are used they share the same user, any operation on a pod may affect other pods. To improve separation and add an extra layer of resource isolation we may use user namespace and assign a range of global kuid_t to pod X and assign another range to pod Y, this may improve the situation and prevent some pods from DoS'ing each other.

Please note that kuid_t is not uid_t. kuid_t is the global kernel UID used to identify process's credentials.

userns_heirarchy

What we are trying to do:

1) User namespace mapping:

rkt will use only 1 level of user namespace. The schema above is just to illustrate the general concept of user namespaces.

Each running pod will have a range of uid assigned. For example, pod1 will have uids 200000 to 200999 (mapped to 0-999) and pod2 will have 201000 to 201999 (mapped to 0-999).
rkt should not reuse uids assigned to other system users (e.g. don’t reuse www-data, geoclue or sshd users!). rkt will be assigned a big range of uids it is allowed to use in /etc/subuid (see manpage subuid(5) and useradd(8)). Example: rkt:200000:65536.

2) User namespace locking:

Since rkt does not have a central daemon to assign uids within the global range allowed by /etc/subuid, we will need some rkt-specific locking to avoid multiple running pods to reuse the same uids. Example: pod1 will lock on /run/rkt/uid-locks/uid-200000; pod2 will lock on /run/rkt/uid-locks/uid-201000. Or maybe something smarter to express the range.

Challenges / things to check:

1) uid-shifts for rootfs:

1.1) At extract time:

the same ACI could be used in several running pods. The rootfs trees are cached in the CAS (Content Addressable Storage) in /var/lib/rkt/cas/ and have specific uid/gid owners and changing them all (recursive chown) is too costly. We could shift uids on the fly at extract time, then it would not be so costly.

1.2) At mount time:

When a rootfs tree is used by overlayfs, we would need some vfs_uid= shifting option.
This was already mention in this lwn article "UID/GID identity and filesystems" http://lwn.net/Articles/637431/
https://lists.linux-foundation.org/pipermail/containers/2014-June/034630.html

volumes bind-mounting: similar issue

2) Network namespace:

On Linux, a network namespace belongs to the user namespace under which it was created, see field user_ns in struct net: http://lxr.free-electrons.com/source/include/net/net_namespace.h#L44.

Rkt currently creates the network namespace in network plugins before systemd-nspawn is launched. If the user namespace is to be created by systemd-nspawn, it means that the network namespace will belong to the host user namespace rather than the pod user namespace. AFAIU, it has unwanted consequences on the access to /sys/class/net/. See also https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=87a8ebd637dafc255070f503909a053cf0d98d3f

3) Bind mounting:

Bind mounting a socket file from the host to the container? E.g. sdnotify socket (https://bugs.freedesktop.org/show_bug.cgi?id=89844).

4) Passing file descriptors:

Passing fds in the other direction from the container to the host? E.g. the journal dirfd that @krnowak is working on #947.

Model

Our model may touch several layers from rkt, nspawn to the kernel

1) kernel: uidshift on the rootfs

As noted in the Challenges section, we may add a uidshift mount option for containers but also to implement dynamic uids for services. This may allow running unprivileged daemons with dynamically assigned uids without leaking into the persistent file system.

Implementation:

1.1) A generic vfs mount option ?

mount(source, target, “bind”, MS_BIND|MS_REMOUNT, vfs_uidshifts)

1.2) An overlayfs mount option or a completely new overlayfs-like fs

Some file systems like NFS are already doing some mapping.

TODO: continue the investigation.

Thanks to @alban for his help on the subject.

@jonboulle jonboulle added this to the v1.0.0 milestone Jun 3, 2015
@iaguis
Copy link
Member

iaguis commented Jun 4, 2015

I tested adding the argument --private users to systemd-nspawn in rkt and running it with --no-overlay.

It then failed in prepare-app when bind-mounting /sys from stage1 to stage2 with EINVAL. I looked a bit on the internet and found http://stackoverflow.com/questions/23417521/mounting-proc-in-non-privileged-namespace-sandbox. Adding MS_REC to the bind-mount did the trick.

Then I had to chmod -R user:group $CONTAINER_ROOT/stage1/rootfs/rkt/env and chmod -R user:group $CONTAINER_ROOT/stage1/rootfs/opt/stage2 because they were owned by root and then I could successfully start a container with user-namespaces!

iaguis    1305  0.0  0.0  19052  7052 pts/2    Ss   14:40   0:00  \_ /bin/bash
root      9883  0.0  0.0  69992  5060 pts/2    S+   15:30   0:00  |   \_ sudo rkt --debug --insecure-skip-verify run -no-overlay -interactive coreos.com/etcd:v2.0.0
root      9884  5.1  0.0  17288  2372 pts/2    S+   15:30   0:02  |       \_ stage1/rootfs/usr/bin/systemd-nspawn --boot --register=true --link-journal=host --private-users=1000:65534 --uuid=f629d909-d811-472e-b254-b427cdafc8db --machine=rkt-f629d909-d811-472e-b254-b427cdafc8db --directory=stage1/rootfs -- --default-standard-output=tty
iaguis    9907  0.0  0.0  25180  3828 ?        Ss   15:30   0:00  |           \_ /usr/lib/systemd/systemd --default-standard-output=tty
iaguis    9915  0.0  0.0   1024     4 ?        Ss   15:30   0:00  |               \_ /waiter
iaguis    9916  0.0  0.0  19064  4084 ?        Ss   15:30   0:00  |               \_ /usr/lib/systemd/systemd-journald
iaguis    9929  0.2  0.0   9512  6324 pts/8    Ssl+ 15:30   0:00  |               \_ /etcd

Notice that systemd and children are owned by my user.

You can test it with: https://github.com/endocode/rkt/tree/iaguis/test-userns

To do the chmod I just put a sleep and do it manually but when we extract the image we already do a chmod so shifting the uid there should work.

@tixxdz
Copy link
Contributor Author

tixxdz commented Jun 4, 2015

@iaguis thx for the test, actually we will have to investigate the recursive mount flag, where did you add the MS_REC flag ?

Taking a look at nspawn it seems that currently /sys is not mounted when userns is set.

@iaguis
Copy link
Member

iaguis commented Jun 4, 2015

endocode@e4880ff

alban added a commit to endocode/rkt that referenced this issue Jun 9, 2015
We tested with:
sudo bin/rkt --debug  --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci

The 10000:65536 is hardcoded in a couple of place for now...
alban added a commit to endocode/rkt that referenced this issue Jun 10, 2015
We tested with:
sudo bin/rkt --debug  --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci

The 10000:65536 is hardcoded in a couple of place for now...
@tixxdz
Copy link
Contributor Author

tixxdz commented Jun 10, 2015

More updates on the model and how things should work from rkt perspective.

Model

2) rkt and uid-shifts for rootfs

It seems we want rkt to automatically detect if user namespace is supported and if so just use it, otherwise perhaps if the user sets some (in)compatible flags then in this case we have to follow what the user requested. This gives us the following scenarios:

2.1) Automatic support and uid-shifts at mount time

The long term plan is to automatically pass the user namespace options "--private-users" through the runtime manifest to systemd-nspawn. To do this we have to check: CONFIG_USER_NS in kernel + the non implemented yet uid-shifting during mount time in the kernel + if rkt has its own global range of UIDs in /etc/subuid. This allows to automatically set the "--private-users" in nspawn.

2.2) Automatic support and uid-shifts at extract time.

If the -no-overlay flag was set + CONFIG_USER_NS in kernel + rkt has its own global range of UIDs in /etc/subuid, then fallback to shift UIDs at extract time and automatically set the "--private-users" in nspawn.

We are already experimenting with this, please see this PR: #1027

2.3) Add "--private-users=[UIDBASE[:NUIDS]]" flag to rkt.

In this case rkt will take another flag to specify the UID-base and UID-range to pass to nspawn in case we don't have /etc/subuid. We will fallback to this method only if the -no-overlay was set and if the kernel supports CONFIG_USER_NS. The UIDs shifting will be done during extract time like in 2.2)

We are already experimenting with this, please see the previous noted PR: #1027

@marineam
Copy link
Contributor

I find it incredulously odd that recursive mout magically makes it work unless it is a security measure to prevent unprivileged containers from exposing something that is covered by a nested mount. I'd have to read the kernel code to see what is going on I suppose.

Alternatively, why bind mount instead of mounting /sys fresh? Recursive is going to pull in debugfs which probably shouldn't be in the container and the cgroup hierarchy which nspawn will set up different mounts for.

@vcaputo
Copy link
Contributor

vcaputo commented Jun 16, 2015

I suppose for the overlayfs uid-shift case we'd need overlayfs to selectively shift the [ug]ids on inodes from the lower layer and simply pass-through the upper layer inode [ug]ids? Presumably we don't want containers to be able to create files owned by the host's uid 0 via the mapping applying to the upper layer.

@alban
Copy link
Member

alban commented Jun 16, 2015

I find it incredulously odd that recursive mout magically makes it work unless it is a security measure to prevent unprivileged containers from exposing something that is covered by a nested mount. I'd have to read the kernel code to see what is going on I suppose.

It seems to be exactly that, see do_mount/do_loopback

2020         if (!recurse && has_locked_children(old, old_path.dentry))
2021                 goto out2;

Alternatively, why bind mount instead of mounting /sys fresh? Recursive is going to pull in debugfs which probably shouldn't be in the container and the cgroup hierarchy which nspawn will set up different mounts for.

The bind mount of /sys we mentioned here is the bind mount from stage1 to stage2, so both the source and the target of the bind mount are in the container. So this should not pull debugfs from the host.

The /sys in stage1 is a fresh mount and debugfs is not requested in the list of fresh mounts.

Since recently, cgroups are mounted in stage1 by rkt with our own choice of read-only/read-write options. nspawn would normally mount it but it explicitely skips that step when it notices it is already mounted. Similarly, systemd in stage1 would mount it if not already mounted.

@alban
Copy link
Member

alban commented Jun 16, 2015

I suppose for the overlayfs uid-shift case we'd need overlayfs to selectively shift the [ug]ids on inodes from the lower layer and simply pass-through the upper layer inode [ug]ids? Presumably we don't want containers to be able to create files owned by the host's uid 0 via the mapping applying to the upper layer.

Interesting question... If the mapping also applies to the upper directory, the container will not be able to create files owned by the host's uid 0 in there (that's a good thing I also guess?) because the host's uid 0 will not be mapped in the container. But then, we'll need to find a way to persist & restore the mapping if container restarts (#551) get implemented because the mapping will be dynamic.

If the mapping does not apply to the upper directory, we cannot prevent files owned by the host's uid 0 to be created (unless the upper directory is itself bind mounted over itself with another [ug]idshift option) but we can remount the overlayfs with a different mapping after a container restart.

/cc @dvdhrm

@tixxdz
Copy link
Contributor Author

tixxdz commented Jun 17, 2015

The followup issue to track the kernel uidshift at mount time is here: #1057

@tixxdz
Copy link
Contributor Author

tixxdz commented Jun 22, 2015

Just an update on this issue.

It seems that we will not use the subordinate uid file /etc/subuid to handle UIDs and their range. The new plan is to let systemd machined do the work for us. I have opened this new systemd issue "User namespace: allow machined to handle UIDs and ranges" systemd/systemd#321

If we go this way, this will allow us to not worry at all about UIDs and ranges mapping or locking in rkt. systemd will do the work for us.

Please if you have any comment on this do not hesitate, this allow us partially to solve the "What we are trying to do:" section of this investigation in a clean way.

Thank you!

alban added a commit to endocode/rkt that referenced this issue Jun 29, 2015
We tested with:
sudo bin/rkt --debug  --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci

The 10000:65536 is hardcoded in a couple of place for now...
alban added a commit to endocode/rkt that referenced this issue Jun 29, 2015
We tested with:
sudo bin/rkt --debug  --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci

The 10000:65536 is hardcoded in a couple of place for now...
alban added a commit to endocode/rkt that referenced this issue Jun 30, 2015
Tested both with "rkt run" and "rkt prepare && rkt run-prepared".

The uid range is not hardcoded anymore.
alban added a commit to endocode/rkt that referenced this issue Jul 29, 2015
We tested with:
sudo bin/rkt --debug  --insecure-skip-verify run -no-overlay -interactive -private-users=10000:65536 ./busybox-latest.aci

The 10000:65536 is hardcoded in a couple of place for now...
tixxdz pushed a commit to endocode/rkt that referenced this issue Aug 4, 2015
Signed-off-by: Djalal Harouni <djalal@endocode.com>
tixxdz pushed a commit to endocode/rkt that referenced this issue Aug 13, 2015
Do the userns chown at extract time.

With the help of:
Alban Crequy <alban@endocode.com>
Simone Gotti <simone.gotti@gmail.com>
Krzesimir Nowak <krzesimir@endocode.com>
Iago López Galeiras <iago@endocode.com>

Signed-off-by: Djalal Harouni <djalal@endocode.com>
tixxdz pushed a commit to endocode/rkt that referenced this issue Aug 13, 2015
Do the userns chown at extract time.

With the help of:
Alban Crequy <alban@endocode.com>
Simone Gotti <simone.gotti@gmail.com>
Krzesimir Nowak <krzesimir@endocode.com>
Iago López Galeiras <iago@endocode.com>

Signed-off-by: Djalal Harouni <djalal@endocode.com>
tixxdz pushed a commit to endocode/rkt that referenced this issue Aug 14, 2015
Do the userns chown at extract time.

With the help of:
Alban Crequy <alban@endocode.com>
Simone Gotti <simone.gotti@gmail.com>
Krzesimir Nowak <krzesimir@endocode.com>
Iago López Galeiras <iago@endocode.com>

Signed-off-by: Djalal Harouni <djalal@endocode.com>
tixxdz pushed a commit to endocode/rkt that referenced this issue Aug 14, 2015
Do the userns chown at extract time.

With the help of:
Alban Crequy <alban@endocode.com>
Simone Gotti <simone.gotti@gmail.com>
Krzesimir Nowak <krzesimir@endocode.com>
Iago López Galeiras <iago@endocode.com>

Signed-off-by: Djalal Harouni <djalal@endocode.com>
@tixxdz
Copy link
Contributor Author

tixxdz commented Aug 17, 2015

An update, the "uid shift with chown without kernel support" task was completed and merged. For the record it was PR #1250

For this same task we will add some functional tests later and perhaps do some cleaning. The user namespace support is currently marked experimental.

Thanks!

@lenucksi
Copy link

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants