Unprivileged containers: some code

The previous article was messy from a code point of view, so before going on to the other namespaces lets get something a little more useful. There is nothing new about containers, so you can skip this, but future episodes of this series will be based on this code.

Before we used the unshare call, this detaches the current namespace from its existing namespace and attaches it to a new one. However for the rest of the series we are going to use the clone call. This creates a new process and places it in the new namespace(s) all at once. The parent process is then able to setup the user and group mappings and communicate to the child process when this has been completed. As we found out previously it's necessary to wait until the user/group mappings to be set up before executing a new process in the new namespaces or indeed before attempting to use the new permissions.

Wrapping the clone call

This is in the file system.py in the example code. First we use cffi to expose the clone() call from the system library.

# coding=utf-8
from __future__ import print_function
from __future__ import unicode_literals

import signal

import six
from cffi import FFI

# For clone and related process orientated system calls.
ffi = FFI()
#define CLONE_NEWCGROUP         0x02000000      /* New cgroup namespace */
#define CLONE_NEWUTS            0x04000000      /* New utsname namespace */
#define CLONE_NEWIPC            0x08000000      /* New ipc namespace */
#define CLONE_NEWUSER           0x10000000      /* New user namespace */
#define CLONE_NEWPID            0x20000000      /* New pid namespace */
#define CLONE_NEWNET            0x40000000      /* New network namespace */
#define CLONE_NEWNS             0x00020000      /* New mount namespace group */

#define CLONE_VM	0x00000100	/* set if VM shared between processes */

#define SIGCHLD     17

int unshare(int flags);

int clone(int (*fn)(void *), void *child_stack,
                 int flags, void *arg, ...
                 /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );

libc = ffi.dlopen(None)

The second part of this file is a normal python function that hides the cffi and low level details. There are a few things to note.

  • The function passed to the clone call must have a single argument. However here we allow multiple positional and keyword arguments, as the user supplied function is wrapped by a function with a single arg that is actually passed to clone. It forms a closure over the multiple arguments that can be supplied to the python clone routine.
  • It must also return an integer. Here we ensure that an integer is returned by examining the return value of the user supplied function in the wrapper. If is just None, then convert this to zero, any other non-integer value is converted to 1.
  • The low byte of the flags passed to clone is fixed by this routine is the value of SIGCHLD, so that normal signalling will occur when the process exits.

Here is the code.

STACK_SIZE = 2 * 1024 * 1024

def clone(func, flags=0, *args, **kwargs):
    Wrap the clone system call with a standard python function.

    :param func: The python function to run in the cloned proceses, must accept a
                 single argument.
    :param flags: Flags that get passed into clone().  This SIGCHLD value will be added.
    :return: The process id of the newly created process.

    # clone requires a chunk of memory to use as a stack.  We need a pointer
    # to the end of the memory, since stacks usually grow down.  If you happen
    # to be on an architecture where this is not true, you will have to modify.
    stack = ffi.new('char[]', STACK_SIZE)
    stack_top = stack + STACK_SIZE

    # Have to wrap as a ffi callback function.
    @ffi.callback('int (void *)')
    def _run(_):
        r = func(*args, **kwargs)
        return int_value(r)

    # Note that we don't pass 'arg' via the system call, although we could there is no
    # need to as it is available in the closure formed by _run().
    return libc.clone(_run, stack_top, flags | signal.SIGCHLD, ffi.NULL)

def int_value(r):
	# type: (any) -> int
    The return value has to be an integer.

    If the return already is one, then just return it.

    If it is None, this is the normal return when nothing is returned from a python
    function, so return 0 in this case.

    For anything else, return 1 since it is an error.

    :param r: The original return value from the user defined function.
    :return: Our return value which is always an integer.
    if isinstance(r, int):
        return r
    elif r is None:
        return 0
        return 1

A namespace class

So here is a simple class that we will build on in future articles in the series.

It creates a user namespace and sets up the user and group mappings before running the user supplied function.

It does assume that the available sub user ids start at 300000, so if you have a different range, you will need to modify to match. Its simple enough to read the /etc/subuid file to discover the available range for a user, but this is left to the reader to implement.

Note how a pipe is used to prevent the child progressing until the parent has set up the user and group mappings.

from __future__ import print_function
from __future__ import unicode_literals

import subprocess

import os
import six

from system import libc, clone

# Change this to match the outside uid in /etc/subuid
OUTSIDE_MIN_GID = 300000  # Same but for group id

class ContainerBase(object):
    A base class to construct a container.

    By itself it creates a new user namespace and can run a user supplied
    function in that environment.

    Subclasses should set the namespace_flags to include the desired
    namespaces, and override setup() to perform requied setup.

    # The namespace flags.  We will set this to a different set of flags
    # in subclasses later in the series.
    namespace_flags = libc.CLONE_NEWUSER  # type: int

    def __init__(self):
        self.pid = None

    def run(self, func, *args, **kwargs):

        :param func: User suplied function to run in the child namespaced process.
        :param args, kwargs: Any arguments for func
        :return: The process id of the new process.

        (rfd, wfd) = os.pipe()

        # We wrap the supplied function so that we can wait until the
        # parent process has setup our environment before calling the
        # user supplied function.
        def _run(*args, **kwargs):
            os.read(rfd, 1)

            # Now we should be able to set uid=0, gid=0

            # Call function for subclasses

            # Now the function will be run as root inside the namespace
            return func(*args, **kwargs)

        pid = clone(_run, self.namespace_flags, *args, **kwargs)
        self.pid = pid

        # Signal to the child that we have finished setting up the namespace
        # by closing the pipe, no need to write anything!
        return pid

    def setup(self):
        Override in subclasses to setup the environment prior to the
        user supplied function being call.  The user and group ids
        have already been set to 0.

    def wait(self):
        if self.pid:
            return os.waitpid(self.pid, 0)

        raise ValueError('No process to wait for')

    def setup_user_maps(self):
        Set up the user and group mappings.

        We are using the newuidmap programs.

        inside_low = 0
        outside_low = OUTSIDE_MIN_UID
        count = 2000

        for cmd in ('uid', 'gid'):
            if cmd == 'gid':
                outside_low = OUTSIDE_MIN_GID
            cmdlist = ['new%smap' % cmd, six.text_type(self.pid)]
            cmdlist.extend([six.text_type(s) for s in (inside_low, outside_low, count)])


To use this

Make sure that you have subuid and subgid ranges from 300000 specified in /etc/subuid and /etc/subgid, or modify the code to match your actual values.

Now a few lines of code will give you a 'root' shell inside a user namespace.

# coding=utf-8
import os

from base import ContainerBase
from system import clone, libc

cb = ContainerBase()
cb.run(os.system, 'bash')

See how you are root inside the container, but if you manage to create any file it will be owned by user 300000 outside the container.