Introduction

One of the key properties of Nestybox system containers is that they support running system-level software (such as Systemd and Docker) without resorting to unsecure privileged containers.

This is made possible by Nestybox’s container runtime Sysbox, which enables Docker to deploy system containers and sets up the system container abstraction.

This article describes some important security features and benefits of Nestybox system containers. These are all specific to Linux as we don’t currently support system containers on other platforms.

Contents

Privileged Container Risks

Since system containers provide an alternative to Docker privileged containers for running system-level workloads, let’s recap some of the risks of using privileged Docker containers (i.e., those running with the Docker --privileged flag) and why it’s not a good idea to use them in general.

Privileged Docker containers are typically used to deploy containers that run workloads that require deep interaction with the underlying kernel. For example, Docker requires them to run their official Docker-in-Docker (DinD) image.

The main problem with Docker privileged containers is that they are very unsecure.

When you launch a Docker container with the --privileged flag, you get a container whose root user is actually root on the host, has all process capabilities enabled, has access to all host devices, and can read or write system-wide kernel controls via procfs (/proc) and sysfs (/sys).

In other words, a process within the container can easily gain control of the host. For example, from within the privileged container you can reboot the host by simply writing doing:

$ echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger

This lack of isolation means that at a minimum, workloads that run within the privileged container must be fully trusted. But even then, any unintended action or bug in the container’s programs can mess up with your host configuration.

Using privileged Docker containers is risky at best, and should be avoided when possible.

System Container Isolation Features

Nestybox system containers provide a much more secure alternative to Docker privileged containers.

They are designed to run the same workloads as privileged containers, but with stronger isolation from the underlying host.

Below we briefly describe some of the key isolation features currently present in Nestybox system containers.

Linux User Namespace and Exclusive User-ID mappings

Nestybox system containers always use all Linux namespaces for enhanced isolation from the host and from other containers.

Of particular importance is the Linux user namespace which works by mapping privileged user-IDs (e.g., root) inside the namespace to fully unprivileged user-IDs on the host.

This ensures that the root user inside the system container is only privileged with respect to resources assigned to the container, but has no privileges otherwise.

The Nestybox container runtime, Sysbox, creates a user-namespace for each system container and configures each container with an exclusive mapping of user-IDs (and group-IDs). This is done to isolate system containers from the host as well as from each other.

For example, let’s launch a system container with Docker and the Sysbox container runtime:

$ docker run --runtime=sysbox-runc -it alpine:latest

And let’s check the user namespace user-ID mapping for it:

/ # cat /proc/self/uid_map
         0     296608      65536

The way to read this is that the system container’s users in the range [0:65535] are mapped to the host user-IDs in the range [296608 : 296608+65535]. This mapping is configured by Sysbox.

Now let’s now deploy another system container and check it’s user-ID map:

$ docker run --runtime=sysbox-runc -it alpine:latest
/ # cat /proc/self/uid_map
         0     362144      65536

Notice how Sysbox used different user-ID mappings for this new system container. The same applies to the group-ID mappings (not shown above).

In other words, system containers deployed with Sysbox get an exclusive user-ID range of 65536 unprivileged user-IDs on the host. We use 65536 IDs per container for POSIX compliance.

Why does this matter? Because if a process inside a system container somehow escapes the container’s root file system, it will find itself without permissions to access any files on the host or in other containers, thereby improving system security.

Linux Capabilities

By virtue of using the Linux user namespace, a root process in the system container may be given all capabilities and the Linux kernel ensures those capabilities only apply to resources assigned to the system container (or more accurately, resources associated with the Linux namespaces that combine to make up the system container).

In fact, the init process for the root user in a Nestybox system container starts with all capabilities enabled:

$ docker run --runtime=sysbox-runc -it alpine:latest

/ # whoami
root

/ # cat /proc/self/status | grep -i cap
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000003fffffffff

These capabilities only apply to resources associated with the system container. In fact, processes in the system container have no capabilities with respect to system-wide resources or resources associated with other containers.

For example, below we repeat the same command shown earlier that allows a privileged container to reboot the host (!), but this time from within a system container:

/ # echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger
/bin/sh: can't create /proc/sys/kernel/sysrq: Permission denied

The Linux kernel prevents the access as it understands that sysrq is a privileged system-wide resource and the root process in the system container has no privileges to access it (even though it has full capabilities within the container).

This ensures the system container processes are only allowed to act on resources assigned to the system container, and can’t modify system-wide settings.

Restricted Device Exposure

Earlier we mentioned that privileged Docker containers expose all host devices inside the container, in essence giving the container full control of the host’s physical and software devices.

In contrast, system containers expose a much lesser number of devices.

For example, when deploying system containers with Docker, you typically see only the following devices inside the system container:

  • /dev/null
  • /dev/zero
  • /dev/full
  • /dev/random
  • /dev/urandom
  • /dev/tty
  • /dev/console
  • /dev/pts
  • /dev/mqueue
  • /dev/shm

This reduced set of devices further helps isolate the system container from the underlying host.

System Container Security Benefits

The prior section described several features used by Nestybox system containers to increase their isolation from the rest of the system.

This section describes other security benefits made possible by these system containers.

Giving Unprivileged Users Access To A Docker Daemon

One of the security precautions used by the Docker daemon is to disallow unprivileged users on a host to create containers.

That is, in order to create containers on a host the user must be either the root user or belong to the docker group (an action which requires root privileges).

The reason unprivileged users are not allowed to create containers is that the Docker daemon on the host runs as root (due to its deep interactions with the Linux kernel). Allowing an unprivileged user to create Docker containers would allow that user to easily gain root access on the machine (e.g., by creating a privileged container for example).

While this restriction makes sense from a security perspective, it’s burdensome on hosts shared by multiple users that want to use Docker. It forces the system admin to either trust all users and give them access to create Docker containers (which is equivalent to giving them root access on the host), or have the sys admin create the containers on behalf of the users.

Nestybox system containers offer an easy-to-use, efficient solution to this problem: a sys admin can now create “docker sandboxes” using system containers, and assign them to unprivileged users. Each sandbox could be configured with systemd, Docker, and sshd as shown below:

syscont-docker-sbox-iso

Unprivileged users can then ssh into their sandbox and deploy Docker containers within it in total isolation from the rest of the system and without requiring root privileges on the host.

This approach solves the problem quickly and easily, and without resorting to a heavy-weight solution such as deploying a VM.

This Nestybox blog post has more info on how to deploy Docker sandboxes using system containers.

Inner Containers Have Two-Layers Of Isolation

Another effect of running Docker inside a Nestybox system container is that containers deployed inside the system container are under two layers of isolation from the rest of the system (as is evident from the figure shown above).

That is, when deploying a Docker container inside a system container, processes inside the “inner container” are restricted by a combination of the inner and outer container isolation mechanisms (e.g., namespaces, cgroups, system call whitelist, exposed devices, etc.).

This strengthens “defense-in-depth” on the host (i.e., escaping to the host requires bypassing isolation mechanisms of the inner container and the system container).

More Work Remains

While Nestybox system containers offer important isolation features, it’s early days for us and more work remains to be done to better secure system containers.

For example, we are working on limiting the amount of non-relevant host information exposed inside a system container, integration with Linux security modules (e.g., AppArmor), better system call restriction controls, and other security-related features.

Conclusion

Nestybox enables Docker containers to run system level workloads (such as systemd and Docker itself) without using unsecure privileged containers.

In fact, Nestybox system containers come with features that provide strong isolation from the underlying host, such as using all Linux namespaces (and in particular the user namespace), exclusive user-ID mappings for strong container-to-host and container-to-container isolation, limited procfs and sysfs exposure, and limited device exposure.

While more work remains to be done to better secure system containers, they already offer a much better alternative to privileged Docker containers for running system-level workloads inside the container.

Try it for Free!

You can try Nestybox system containers for free! Check our website for info on how to get Sysbox, our container runtime.

We are looking for early adopters and your feedback would be much appreciated!