Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

rkt: kernel: uidshifts at mount time #1057

Open
tixxdz opened this issue Jun 17, 2015 · 5 comments
Open

rkt: kernel: uidshifts at mount time #1057

tixxdz opened this issue Jun 17, 2015 · 5 comments

Comments

@tixxdz
Copy link
Contributor

tixxdz commented Jun 17, 2015

This issue is a followup of the UID shifts for the rootfs from the "Investigation user namespaces" issue #986.

In order to take full advantage of user namespaces in rkt we are planning to have uidshifts during mounts. Doing the uidshift on the rootfs as a mount option will allow to run daemons as unprivileged with the dynamically assigned uids and without leaking into the persistent file system.

Model

Background

When a rootfs tree is used by overlayfs, we would need some vfs_uid= shifting option.
This was already mention in this lwn article "UID/GID identity and filesystems" http://lwn.net/Articles/637431/
https://lists.linux-foundation.org/pipermail/containers/2014-June/034630.html

Implementation:

1.1) A generic vfs mount option ?

mount(source, target, “bind”, MS_BIND|MS_REMOUNT, vfs_uidshifts_data)

1.2) An overlayfs mount option or a completely new overlayfs-like fs

Some file systems like NFS are already doing some mapping.

@dvdhrm
Copy link

dvdhrm commented Jun 18, 2015

Just dumping some thoughts while working on this previously:

The idea behind this feature is to allow mounting the same volume multiple times but apply different UID-mappings to each mount. For example, this allows mounting /usr in a user-namespaced container with the same UID-mappings applied as the user-namespace itself. Hence, any access of the container to /usr will happen with the UIDs of inside the container, not the UIDs it has outside the container. For the rest of this article, uid refers to the numerical value a process sees via getuid(); kuid refers to the value the kernel sees in the init_user_ns.

To support such mappings, we must make them a property of a bind-mount (more precisely, a property of struct mount). This is very similar to how the ro (read-only) flag works. You can mount a single volume multiple times and only make some of them read-only.

In the kernel, all metadata that is transferred between file-systems and the kernel is done via struct inode. This object represents the actual data on disk, it does not represent a specific view. Therefore, inode->i_uid must always contain the ID that is written on the disk, not the ID that a process sees. This means, our mapping must be applied afterwards. Whenever inode->i_uid is accessed, it must be mapped according to your access rules. However, the access rules require a mount-point, as each mount-point might have different mappings applied. Therefore, the main work of this feature is to make sure every place that reads or writes inode->i_uid must have a struct vfsmount (alternatively struct mount) context at hand. If that's given, we can create 2 functions:

kuid_t i_read_kuid(const struct inode *inode, const struct vfsmount *mnt);
void i_write_kuid(struct inode *inode, const struct vfsmount *mnt, kuid_t uid);

Those functions would map the kuid_t according to the rules in the vfsmount. To make sure no-one accesses inode->i_uid directly, I'd turn it back into a uid_t instead of kuid_t. Then you can also drop the FS helpers i_read_uid and i_write_uid (notice the uid, not kuid) again.

If this infrastructure is in place, you can be sure that all UIDs are always correctly mapped. No need to fiddle around with kstat and friends. All you need to do is make sure everyone has a struct vfsmount context. Unfortunately, this is where it gets nasty:

Right now, a lot of code in the kernel does not have a vfsmount context. This is especially true for a lot of VFS callbacks (like ->mkdir, ->symlink, ...). All those functions leave it up to the FS to decide which i_uid to assign to any new inode. Therefore, those callbacks must pass down a vfsmount. But this requires to change _all_ those callbacks in all file systems.

What I would do (and which I partly did so far), is to change struct inode->i_uid to an uid_t and disable all CONFIG_* options for file-systems I don't care about. Then I hacked on these two helper functions I described above. I added a hack to allow passing NULL as vfsmount (which just skips the mapping / identity mapping) so I can at least make the kernel compile (after fixing the 100 accesses that were left). Then I changed the core syscalls (like kstat) to pass the vfsmount to those helpers. Most calls still bypass the mapping, but at least i can start testing fstat in user-space to see whether the mapping worked.
Regarding the mapping itself, I added a boolean option (just like 'ro' for read-only) to the bind-mount command. all this option does is to pin the user-namespace you had at time of bind-mount and remember it in struct mount. I then use this user-namespace mapping inside of my two helper functions. In user-space, you now simply create a user-namespace, setup your mapping, bind-mount your file-system and then abort(). The bind-mount survives, but your user-namespace is destroyed when your process exits. In the parent, you now got a bind-mount with your mapping applied.
This way, you avoid writing your own mapping-parser...

I think that's it so far. Feel free to setup a public github repo and I'll help out!

@alban
Copy link
Member

alban commented Jun 18, 2015

@dvdhrm thanks for the write-up!

disable all CONFIG_* options for file-systems I don't care about.

In order to avoid implementing it in every possible filesystem on Linux, but still have something correct, would it help to add a flag on (struct file_system_type)->fs_flags named FS_ALLOW_UID_SHIFT to specify whether using the uid-shifting mount options is allowed on that specific filesystem?

In this way, an initial patch could be accepted upstream without modifying most filesystems.

It would be similar to how FS_USERNS_MOUNT works for disallowing new mounts from most filesystems in a user namespace.

Regarding the mapping itself, I added a boolean option (just like 'ro' for read-only) to the bind-mount command [...] This way, you avoid writing your own mapping-parser...

Is your boolean option temporary or do you want to keep that API? What's the name of the option? pin_userns?

How should it work with stackable uid mapping (with stackable user namespaces)?

What happens if you bind mount a source directory which already had a mapping? Do you stack the two mappings?

What if you bind mount recursively /a to /b with MS_REC & pin_userns? Would /a/subdir be bind mounted to /b/subdir with the userns mapping from pin_userns stacked on top of the initial /a/subdir mapping, if any?

If the uid mappings are not stackable, how to prevent a process in a user namespace to revert the effect of the uid mapping by replacing the mapping by another one (or by an empty mapping)?

@dvdhrm
Copy link

dvdhrm commented Jun 18, 2015

In order to avoid implementing it in every possible filesystem on Linux, but still have
something correct, would it help to add a flag on (struct file_system_type)->fs_flags named
FS_ALLOW_UID_SHIFT to specify whether using the uid-shifting mount options is allowed on
that specific filesystem?

In this way, an initial patch could be accepted upstream without modifying most filesystems.

You cannot do that. You really have to change inode->i_uid back to the uid_t type, and this will break a lot of code. If you don't do this switch, you use incorrect types in that structure and you might even end up forgetting to convert some code (if you change it, you get proper compiler errors).

However, what you can do is to make all those filesystems pass NULL as vfsmount and as such don't support the mapping. This would already reduce the amount of work significantly. This can then be supported with such a flag FS_ALLOW_UID_MAP.

Is your boolean option temporary or do you want to keep that API?

Temporary.

I really think the API should re-use the struct uid_gid_map from user-namespaces, though. Otherwise, we're not compatible to uid-mappings which I think we really should be.

Maybe the logic I described will work out in the end. But I haven't thought it through. This is really just for simple 1-layer testing. No stacking involved, etc as it is, imho, an orthogonal problem.

@tixxdz tixxdz self-assigned this Jun 22, 2015
@alban
Copy link
Member

alban commented Jun 29, 2015

@tixxdz added some kernel selftests (still not complete) in this branch:
https://github.com/systemd/linux/commits/djalal-uidshift-test
More implementation patches will be added after.

@tixxdz
Copy link
Contributor Author

tixxdz commented Jul 22, 2015

Some updates, currently we have a blocker:

In commit systemd/linux@a628126 of the test branch we pass the vfsmount context to generic_fillattr() so we can do the uidshift mapping inside that function and let all the vfs functions and other filesystems take advantage of it, however later in all the stat64(), lstat64(), fstat64() syscalls and others, the uid mapping is overwritten in cp_new_stat64() http://lxr.free-electrons.com/source/fs/stat.c#L363.

cp_new_stat64() is one of the last functions that is called to copy the stat struct to user space, inside that function the uid and gid fields are mapped again to the current_user_ns(). What we want is that the uid and gid fields should stay mapped into the userns that was pinned during bind mount.

Solution 1):
In order to solve this we may want to update all the vfs_stat() helpers family to pass the userspace statbuf address this way we can call cp_new_stat64() and others inside vfs_stat() function where we have a 'path->mnt' context to access the pinned userns and use it for the translation instead of current_user_ns().

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants