Exploring unprivileged containers

This series of articles will explore creating unprivileged containers on Linux using python and shell commands for ease of experimenting.

A container is something you construct using various system facilities such as Linux namespaces to isolate a process or a group of processes from the rest of the system. This isolation can be partial or almost complete giving the illusion of a separate machine.

A fairly recent distribution will be required to do everything that will be demonstrated here. This series of articles was prepared on a Fedora 24 distribution which has a 4.7 kernel at the time of writing.

The user namespace

A user namespace isolates user and group IDs, the root directory and other security related capabilities. There is a top level user namespace, and new child namespaces can be created. The user IDs and privileges can be different inside and outside the new namespace. In particular a process can have an ordinary unprivileged user ID outside of the namespace (in the top level root namespace), but have a user ID of 0 and full privileges for operations within the namespace.

Now the question is what it means to operate within the namespace? Just because you are root within the namespace does not mean that you can suddenly edit /etc/passwd. You can only affect resources that are governed by the new namespace and any other namespace associated with it. The particular details and restrictions that are imposed are what most of this series is about.

Unless granted further permissions by a privileged user or process, it is not possible to access or affect a resource owned by another user that you could not access or affect outside the namespace.

By 'not possible' of course I mean in the absence of any bugs.

This is the only namespace that can be created by an unprivileged user, however within that user namespace you have the required privilege to create the other namespaces. You can also create them all within the same call to unshare or clone, in effect the user namespace is created first by the kernel and the other namespaces are created within that context.

There is a lot of information in man user_namespaces.

Some code

We will just start with just the user namespace by itself. Alone this is not particularly useful, but there is plenty to discuss that will be used in later parts of this series.

Since we need access to the 'unshare' system call and this is not available in the standard python libraries, we will use cffi to access it. Create a virtualenv, activate it and install the 'cffi' package. Alternatively you could install the system package'python2-cffi' with dnf/yum or the equivalent package on other distributions.

So lets just start with the simplest piece of code to access the unshare call.

from __future__ import print_function, unicode_literals

import os
from cffi import FFI

CLONE_NEWUSER = 0x10000000


ffi = FFI()

ffi.cdef('''
int unshare(int flags);
''')

libc = ffi.dlopen(None)

Now add the following short snippet of code to actually create the namespace and run a shell inside it.

# Create the user namespace
libc.unshare(CLONE_NEWUSER)

# Print out user id and process id to use later
print("user id = %d, process id = %d" % (os.getuid(), os.getpid()))

# Run a shell in this namespace
os.execlp('/bin/bash', 'bash')

If you have two terminals open, one can run this program and the second logged in as the same user. I will use the terms 'inside' for the shell running in the namespace and 'outside' for the regular terminal shell.

Results

After running the program you will be in a shell so you can explore.

$ id
uid=65534(nfsnobody) gid=65534(nfsnobody) groups=65534(nfsnobody) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

$ ls -l
total 16
-rw-rw-r--. 1 nfsnobody nfsnobody 419 Sep 14 15:25 part1.py
-rw-rw----. 1 nfsnobody nfsnobody   1 Sep 13 21:51 group_write
-rwx------. 1 nfsnobody nfsnobody   1 Sep 13 21:51 only_owner_write
-rw-rw-r--. 1 nfsnobody nfsnobody   1 Sep 13 22:35 owned_by_root

$ ls -l /
total 68
lrwxrwxrwx.   1 nfsnobody nfsnobody     7 Feb  3  2016 bin -> usr/bin
dr-xr-xr-x.   6 nfsnobody nfsnobody  4096 Sep  3 08:33 boot
drwxr-xr-x.  22 nfsnobody nfsnobody  4800 Sep  7 12:08 dev
drwxr-xr-x. 167 nfsnobody nfsnobody 12288 Sep 14 09:38 etc
drwxr-xr-x.   8 nfsnobody nfsnobody  4096 Feb  3  2016 home
lrwxrwxrwx.   1 nfsnobody nfsnobody     7 Feb  3  2016 lib -> usr/lib
lrwxrwxrwx.   1 nfsnobody nfsnobody     9 Feb  3  2016 lib64 -> usr/lib64
drwx------.   2 nfsnobody nfsnobody 16384 Jun 29  2015 lost+found
drwxr-xr-x.   2 nfsnobody nfsnobody  4096 Feb  3  2016 media
[...]

$ getpcaps $$
Capabilities for `21318': =

What we have here is a user id namespace without any mapping to the outside user ids. By default all unmapped IDs are appear as uid 65534. This is displayed as nobody or nfsnobody depending on what is in /etc/passwd on your system.

Other things to note:

  • if you create a new file, it also shows the owner being nobody
  • on the outside of the container, the newly created file is however owned by yourself
  • everyone appears to be nobody.
  • the process has no special privileges

Mapping user ids

To do something more useful, user id's have to be mapped from the new inside namespace to the outside host system.

There is a file in /proc that controls this. It is located at /proc/self/uid_map. Replace 'self' with a process id to view the uid_map of another process.

You can just cat this file, outside in the root user namespace there is a no-op mapping of every possible user id to itself.

$ cat /proc/self/uid_map
         0          0 4294967295

Inside a new namespace the file is empty indicating that there is no mapping present.

So time to create one, the first number is the user id inside, and the second number is the user id outside in the parent namespace. This final number is how many user IDs there are in the range. You can have more than one line, but they have to be written all at once. It is only possible to write the file once.

The first thing to note is that you cannot write to the uid_map file from inside the namespace.

# Try inside the container:

# Use the user id and process id that was printed at startup
# instead of 1001 and 23235 respectively
$ echo 0 1001 1 > /proc/23235/uid_map
bash: echo: write error: Operation not permitted

OK, so it is not possible to write this file from the inside, it has to be written from another process outside. So use the same values at another terminal. This time there should be no error.

Go back to the inside terminal and we see the following:

$ id
uid=0(root) gid=65534(nfsnobody) groups=65534(nfsnobody) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

nfsnobody@ariel:/home/itinken/web/itinken/content/containers/ex
$ ls -l
total 16
-rw-rw-r--. 1 root      nfsnobody 419 Sep 14 15:25 part1.py
-rw-rw----. 1 root      nfsnobody   1 Sep 13 21:51 group_write
-rwx------. 1 root      nfsnobody   1 Sep 13 21:51 only_owner_write
-rw-rw-r--. 1 nfsnobody nfsnobody   1 Sep 13 22:35 owned_by_root

$ echo > fred
$ ls -l fred
-rw-rw-r--. 1 root nfsnobody 1 Sep 15 11:05 fred

$ getpcaps $$
Capabilities for `24651': =

$ exec bash
$ getpcaps $$
Capabilities for `24651': = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37+ep

$ chroot /
$ exit

$ mount -t proc proc /proc
mount: permission denied

Now your own user id outside is mapped to the user id 0 (root) inside. So any file that was owned by you will now show as being owned as root, any file owned by any other user will again be shown as owned by nobody.

  • Any file you create will be owned by root inside, yourself outside.
  • Cannot access any file that you could not as yourself.
  • Still have no privileged capabilities in the shell.
  • After execing to another shell you do now have privileges in the new shell. This appears to go against the documentation, as privileges gained on creating the namespace should have been lost on execing to the first shell. You can't rely on this as will be seen later, this does not happen except in this case.
  • It is possible to call chroot.
  • Still can't perform any root only operations involving mounting filesystems etc.

Since the uid_map file can only be written to once, to find out what happens when writing different numbers, you have to exit and restart each time. You will find:

  • The first number can be anything, and you will have that uid inside the namespace.
  • The second number must be your own uid and any other value fails with operation not permitted.
  • The third number must be 1 or else again the operation is not permitted or a value of zero is just invalid.

Writing uid_map as root

The previous section applies to an unprivileged user, and those restrictions have to apply since the system cannot allow an unprivileged user to have a different user id outside the container.

For a root user (or with the cap_setuid privilege), thing are very different, you can set up a mapping to any outside user including root. Just because you can do this doesn't mean that you should, for most normal uses you will map to a completely unused range of user IDs on the outside of the container.

If you just try this out, you'll probably find that it doesn't really quite work, this is because the uid map has the be set before the shell is executed.

Just for now add a sleep and a call to setuid to 0 before executing the shell.

libc.unshare(CLONE_NEWUSER)

import time
time.sleep(20)

# The uid must be set to 0 to avoid loosing capabilities when
# creating the shell.
os.setuid(0)
os.execlp('/bin/bash', 'bash')

Now write '0 100000 2000' into the uid_map as root while the process is sleeping. As long as you are quick enough, when the shell runs it will be as root inside the container. Outside the container you will be user id 100000 and 2000 user ids will be available. By this I mean that 0 maps to 100000, 1 maps to 100001... all the way to 1999 inside mapping to 101999 outside.

You can show this latter point using the python shell.

$ python
Python 2.7.12 (default, Aug  9 2016, 15:48:18) 
[GCC 6.1.1 20160621 (Red Hat 6.1.1-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.seteuid(0)
>>> os.seteuid(1999)
>>> os.seteuid(0)
>>> os.seteuid(2000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument

As you can see you are able to set your effective uid to 1999, but 2000 is just not available. It is not a case of permission denied; it just doesn't exist for the process.

Using newuidmap

There is a setuid program called newuidmap that is part of the shadow-utils package. This can be run by unprivileged users to set the uid map for a process. Of course it cannot be allowed to set any random user id as the outside user id, so there is a file /etc/subuid that contains the allowed user ids allowed for a user.

The file looks like this:

tinka:100000:65536
tashy:200000:65536

Here the user tinka is allowed to set the 65 thousand user ids from 100000, and tashy is allowed to set user ids from 200000. Unless you want to do something special, you normally allocate a range of uids at some high value that does not overlap any system uids or the sub ids of any other unrelated user.

You can use the usermod program with the -v option (-w for group IDs) to add to this file. Its possible to add more than one allowed range.

Now newuidmap can be used to set the uid map.

Same thing for group ids... almost

Everything that has been discussed about user IDs applies to group IDs. Just replace uid with gid: gid_map, newgidmap etc.

The exception is that you are not able to write the /proc/<pid>/gid_map file without first disabling setgroup(). You do this by writing 'deny' to /proc/<uid>/setgroups.

This restriction exists because it is possible that a group is used to deny permission to a file that others outside the group are allowed to see (eg with a permission like -rw----r--). If you were able to drop a group membership you would be able to access that file.

This restriction does not apply if set by a privileged user.

Conclusions

This was quite a long article, and yet there are few interesting things we can construct with just the user namespace. You can create a complete root file system and chroot into it which is something that an unprivileged user cannot do normally. If you want to build and test application software in a known environment without affecting the host system this could well be enough.

For more isolation, you will need to create other namespaces associated with the user namespace, which will be the subject of later articles.