Secure Docker-in-Kubernetes

January 03, 2022

Intro
Motivation
Setup
Why is Sysbox Useful Here?
Kubernetes Cluster Creation
Defining the Pods (with Docker inside)
Persistent Docker Cache
Deploying the Pods
Verify the Pods are Working
Exposing the Pods IP to the outside
Connecting Remotely to the Pods
Shared Docker Images across Docker Engines
Resource Limits
Scaling Pod Instances
Persistent Volume Removal
Docker Build Context
Conclusion
Resources

Intro

This post shows you how to run Docker inside a secure (rootless) Kubernetes pod. That is, you create one or more Kubernetes pods and inside of each you run Docker.

While running Docker inside pods is not new, what’s different here is that the pod will not be an insecure “privileged” pod. Instead, it will be a fully unprivileged (rootless) pod launched with Kubernetes and the Sysbox runtime, which means you can use this setup in enterprise settings where security is very important.

We will show you how to set this up quickly and easily with examples, and afterwards you can adjust these per your needs.

Motivation

There are several uses cases for running Docker inside a Kubernetes pod; a couple of useful ones are:

Creating a pool of Docker engines on the cloud. Each user is assigned one such engine and connects remotely to it via the Docker CLI. Each Docker engine runs inside a Kubernetes pod (instead of a VM), so operators can leverage the power of Kubernetes to manage the pool’s resources.
Running Docker inside Kubernetes-native CI jobs. Each job is deployed inside a pod and the job uses the Docker engine running inside the pod to build container images (e.g., Buildkit), push them to some repo, run them, etc.

In this blog post we focus on the first use case. A future blog post will focus on the second use case.

Setup

The diagram below shows the setup we will create:

As shown:

Kubernetes will deploy the pods with the Sysbox runtime.
Each pod will run a Docker engine and SSH in it.
Each Docker engine will be assigned to a user (say a developer working from home with a laptop).
The user will connect remotely to her assigned Docker engine using the Docker CLI.

Why is Sysbox Useful Here?

Prior to Sysbox, the setup shown above required insecure “privileged” containers or VM-based alternatives such as KubeVirt.

But privileged containers are too insecure, and VMs are slower, heavier, and harder to setup (e.g., KubeVirt requires nested virtualization on the cloud).

With Sysbox, you can do this more easily and efficiently, using secure (rootless) containers without resorting to VMs.

Kubernetes Cluster Creation

Ok, let’s get to it.

First, you need a Kubernetes cluster with Sysbox installed in it. It’s pretty easy to set this up as Sysbox works on EKS, GKE, AKS, on-prem Kubernetes, etc.

See these instructions to install Sysbox on your cluster.

For this example, I am using a 3-node Kubernetes cluster on GKE, and I’ve installed Sysbox on it with this single command:

kubectl apply -f https://raw.githubusercontent.com/nestybox/sysbox/master/sysbox-k8s-manifests/sysbox-install.yaml

Defining the Pods (with Docker inside)

Once Sysbox is installed on your cluster, next step is to define the pods that carry the Docker engine in them.

We need a container image that carries the Docker engine. In this example, we use an image called nestybox/alpine-supervisord-docker:latest that carries Alpine + Supervisord + sshd + Docker. The Dockerfile is here.

NOTE: You can use another image if you would like. Just make sure that the image is configured to start Docker and SSH inside the container automatically.

Next, let’s create a Kubernetes StatefulSet that will provision 6 pod instances (e.g., 2 per node). Each pod will allow remote access to the Docker engine via ssh. Here is the associated yaml file:

$ cat dockerd-statefulset.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dockerd-statefulset
spec:
  selector:
    matchLabels:
      app: dockerd
  serviceName: "dockerd"
  replicas: 6
  template:
    metadata:
      labels:
        app: dockerd
      annotations:
        io.kubernetes.cri-o.userns-mode: "auto:size=65536"
    spec:
      runtimeClassName: sysbox-runc
      terminationGracePeriodSeconds: 20
      containers:
      - name: alpine-docker
        image: nestybox/alpine-supervisord-docker:latest
        ports:
        - containerPort: 22
          name: ssh
        volumeMounts:
        - name: docker-cache
          mountPath: /var/lib/docker
  volumeClaimTemplates:
  - metadata:
      name: docker-cache
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "gce-pd"
      resources:
        requests:
          storage: 2Gi
  podManagementPolicy: Parallel

Before we apply this yaml, let’s analyze a few things about it.

First, we chose a StatefulSet (instead of a Deployment) because we want each pod to have unique and persistent network and storage resources across it’s life cycle. This way if a pod goes down, we can recreate it and it will have the same IP address and the same persistent storage assigned to it.

Second, note the following about the StatefulSet spec:

It creates 6 pods in parallel (see replicas and podManagementPolicy).
The pods are rootless by virtue of using Sysbox (see the cri-o annotation and sysbox-runc runtimeClassName).
Each pod exposes port 22 (ssh).
Each pod has a persistent volume mounted onto the pod’s /var/lib/docker directory (see next section).

Persistent Docker Cache

In the StatefulSet yaml shown above, we mounted a persistent volume on each pod’s /var/lib/docker directory.

Doing this is optional, but enables us to preserve the state of the Docker engine (aka “the Docker cache”) across the pod’s life cycle. This state includes pulled images, Docker volumes and networks, and more. Without this, the Docker state will be lost when the pod stops.

Note that each pod must have a dedicated volume for this. Multiple pods can’t share the same volume because each Docker engine must have a dedicated cache (it’s a Docker requirement).

Also, note that the persistent storage is provisioned dynamically (at pod creation time, one volume per pod). This is done via a volumeClaimTemplate directive, which claims a 2GiB volume of a storage class named “gce-pd”.

For this example, 2GiB is sufficient; for a production scenario, you’ll likely need much more storage since Docker storage can add up over time when pulling multiple images.

What is “gce-pd”? It’s a storage class that uses the Google Compute Engine (GCE) storage provisioner. The resource definition is below:

$ cat gce-pd.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gce-pd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  fstype: ext4
  replication-type: none
volumeBindingMode: WaitForFirstConsumer

Since my cluster is on GKE, using the GCE storage provisioner makes sense. Depending on your scenario, you can use any other provisioner supported by Kubernetes (e.g., AWS EBS, Azure Disk, etc).

In addition, whenever we use volumeClaimTemplate, we must also define a dummy local-storage class (as otherwise Kubernetes will fail to deploy the pod):

$ cat local-storage.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Deploying the Pods

With this in place, we can now apply the yamls shown in the prior section.

$ kubectl apply -f gce-pd.yaml
$ kubectl apply -f local-storage.yaml
$ kubectl apply -f dockerd-statefulset.yaml

If all goes well, you should see the StatefulSet pods deployed within 10->20 seconds, as shown below:

$  kubectl get pods

NAME                    READY   STATUS    RESTARTS   AGE
dockerd-statefulset-0   1/1     Running   0          9m51s
dockerd-statefulset-1   1/1     Running   0          9m51s
dockerd-statefulset-2   1/1     Running   0          9m51s
dockerd-statefulset-3   1/1     Running   0          9m51s
dockerd-statefulset-4   1/1     Running   0          9m51s
dockerd-statefulset-5   1/1     Running   0          9m51s

You should also see the persistent volumes that Kubernetes dynamically allocated to the pods:

$ kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                        STORAGECLASS   REASON   AGE
pvc-377c35d8-4075-4d40-9d26-7e4acd42cbea   2Gi        RWO            Delete           Bound    default/docker-cache-dockerd-statefulset-1   gce-pd                  14m
pvc-5937a358-5111-4b91-9cce-87a8efabbb62   2Gi        RWO            Delete           Bound    default/docker-cache-dockerd-statefulset-3   gce-pd                  14m
pvc-5ca2f6ba-627c-4b19-8cf0-775395868821   2Gi        RWO            Delete           Bound    default/docker-cache-dockerd-statefulset-4   gce-pd                  14m
pvc-9812e3df-6d7e-439a-9702-03925af098a5   2Gi        RWO            Delete           Bound    default/docker-cache-dockerd-statefulset-0   gce-pd                  14m
pvc-afd183ab-1621-44a1-aaf0-da0ccf9f96a8   2Gi        RWO            Delete           Bound    default/docker-cache-dockerd-statefulset-5   gce-pd                  14m
pvc-e3f65dea-4f97-4c4b-a902-97bf67ed698b   2Gi        RWO            Delete           Bound    default/docker-cache-dockerd-statefulset-2   gce-pd                  14m

Verify the Pods are Working

Let’s exec into one of the pods to verify all is good:

$ kubectl exec dockerd-statefulset-0 -- ps
PID   USER     TIME  COMMAND
root      0:00 {supervisord} /usr/bin/python3 /usr/bin/supervisord -n
root      0:00 /usr/bin/dockerd
root      0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
root      0:02 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
root      0:00 ps

Perfect: supervisord (our process manager in the pod) is running as PID 1, and it has started Dockerd and sshd.

Let’s check that Docker is working well:

$ kubectl exec dockerd-statefulset-0 -- docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Great, Docker is responding normally.

Finally, check that the pod is rootless:

$ kubectl exec -it dockerd-statefulset-0 -- cat /proc/self/uid_map
    0     362144      65536

This means User-ID 0 in the pod (root) is mapped to user-ID 362144 on the host, and the mapping extends for 65536 User-IDs.

In other words, you can work as root inside the pod without fear, as it has no privileges on the host.

Exposing the Pod’s IP Outside the Cluster

Now that the pods are running, we want to access the Docker engine inside each pod. In this example, we want to access the pods from outside the cluster, and do it securely.

For example, we want to give a developer sitting at home with her laptop access to a Docker engine inside one of the pods we’ve deployed.

To do this, we are going to create a Kubernetes “Load Balancer” service that exposes the pod’s SSH port externally.

Note that we need one such service per pod (rather than a single service that load balances across several pods). The reason is that the pods we’ve created are not fungible: each one carries a stateful Docker engine.

The simplest (but least automated) way to do this is to manually create a LoadBalancer service for each pod. For example, for pod dockerd-statefuset-0:

apiVersion: v1
kind: Service
metadata:
  name: dockerd0-service
spec:
  type: LoadBalancer
  selector:
    statefulset.kubernetes.io/pod-name: dockerd-statefulset-0
  ports:
  - protocol: TCP
    port: 22
    targetPort: 22

Applying this yaml causes Kubernetes to expose port 22 (SSH) of the dockerd-statefulset-0 via an external IP:

$ kubectl get svc dockerd0-service
NAME               TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)        AGE
dockerd0-service   LoadBalancer   10.72.8.250   35.194.9.153   22:32547/TCP   3m57s

We need to repeat this for each of the pods of the stateful set.

Note that there are automated ways to do this, but they are beyond the scope of this blog.

Connecting Remotely to the Pods

Now that you have the pods running on the cluster (each pod running an instance of Docker engine) and a service that exposes each externally, let’s connect to them remotely.

There are two parts to accomplish this:

Configure ssh access to the pod.
Use the Docker CLI to connect to the pod remotely via ssh.

SSH config

To do this:

Exec into one of the pods, and create a password for user root inside the pod:

$ kubectl exec -it dockerd-statefulset-0 -- passwd
Changing password for root
New password: <some-password>

Give the pod’s external IP address (see prior section) and password to the remote user in some secret way.
The remote user copies her machine’s public SSH key (e.g., generated with ssh-keygen) to the pod.

For example, if the pod’s external IP is 35.194.9.153:

ssh-copy-id root@35.194.9.153

root@35.194.9.153's password: <some-secret-password>

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@35.194.9.153'"
and check to make sure that only the key(s) you wanted were added.

Docker CLI Access

After SSH is configured, the last step is to set up the Docker client to connect to the remote Docker engine. For example:

$ docker context create --docker host=ssh://root@35.194.9.153 remote-docker
remote-docker
Successfully created context "remote-docker"

$ docker context use remote-docker
remote-docker
Current context is now "remote-docker"

And now we can access the remote Docker engine:

$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

There it is! The remote user can now use her dedicated Docker engine to pull and run images as usual.

At this point you have a working setup. The remaining sections describe topics you should keep in mind as you work with the remote cluster.

Shared Docker Images across Docker Engines

In the current setup, each Docker engine was configured with a dedicated persistent Docker cache (to cache container images, Docker volumes, networks, etc.).

But what if you want multiple Docker engines to share an image cache?

You may be tempted to do this by having multiple Docker pods share the same Docker cache. For example, create a persistent volume for a Docker cache and mount the same volume into the “/var/lib/docker” directory of multiple pods. But this won’t work, because each Docker engine must have a dedicated cache.

A better way to do this is to setup a local image registry using the open-source Docker registry. For example, this local registry could run in a pod within your cluster, and you can then direct the Docker engine instances to pull/push images from it.

How to do this is beyond the scope of this article, but here is some useful info on this:

Scaling Pod Instances

To scale the pods (i.e., scale up or down), simply modify the replicas: clause in the StatefulSet yaml and apply it again.

You will also need to create the Load Balancer service for any newly added pods.

Note however that when you scale down, the Load Balancer services and persistent volumes mounted on the pod’s /var/lib/docker are not removed automatically (you must explicitly remove them as shown next).

Persistent Volume Removal

In the StatefulSet we created above, we asked Kubernetes to dynamically create a persistent volume for each pod and mount it on the pod’s /var/lib/docker directory when the pod is created (see section Persistent Docker Cache above).

When the pod is removed however, Kubernetes will not remove the persistent volume automatically. This is by design, because you may want to keep the contents of the volume in case you recreate the pod in the future.

To remove the persistent volume do the following:

Stop the pod using the persistent volume.
List the persistent volume claims (pvc):

$ kubectl get pvc
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
docker-cache-dockerd-statefulset-0   Bound    pvc-9812e3df-6d7e-439a-9702-03925af098a5   2Gi        RWO            gce-pd         25h
docker-cache-dockerd-statefulset-1   Bound    pvc-377c35d8-4075-4d40-9d26-7e4acd42cbea   2Gi        RWO            gce-pd         25h
docker-cache-dockerd-statefulset-2   Bound    pvc-e3f65dea-4f97-4c4b-a902-97bf67ed698b   2Gi        RWO            gce-pd         25h
docker-cache-dockerd-statefulset-3   Bound    pvc-5937a358-5111-4b91-9cce-87a8efabbb62   2Gi        RWO            gce-pd         25h
docker-cache-dockerd-statefulset-4   Bound    pvc-5ca2f6ba-627c-4b19-8cf0-775395868821   2Gi        RWO            gce-pd         25h
docker-cache-dockerd-statefulset-5   Bound    pvc-afd183ab-1621-44a1-aaf0-da0ccf9f96a8   2Gi        RWO            gce-pd         25h

Remove the desired pvc; this will also remove the persistent volume:

$ kubectl delete pvc docker-cache-dockerd-statefulset-5

Docker Build Context

When running the Docker engine remotely, be careful with Docker builds. The reason: the Docker CLI will transfer the “build context” (i.e., the directory tree where the Dockerfile is located) over the network to the remote Docker engine. This can take a long time for large images.

Docker Buildkit can help here, since it tracks changes and only transfers the portion of the build context that has changed since a prior build.

Conclusion

Running Docker inside Kubernetes pods has several use cases, such as offloading the Docker engine away from local development machines (e.g., for efficiency or security reasons).

However, until recently doing this required very insecure privileged pods or VMs.

In this blog, we showed how to do this easily & securely with pure containers, using Kubernetes + Sysbox. We hope this is helpful.

Feel free to add your comments below, and/or join the Sysbox Slack channel for any questions.