Container isolation with Kata and gVisor in Docker

Container isolation with Kata and gVisor in Docker

Overview

Containers are an efficient way to build and distribute workloads free of (most) host and OS dependencies, but they come at the cost of reduced security or isolation compared to virtual machines. This is why public cloud services spin up virtual machines per customer to deploy container based workloads in them. Ok, maybe there is also a commercial reason behind it, allowing to charge for the virtual machine with a fixed cpu core and memory allocation, but that's beside the point of this blog post.

What if you could give each container its own virtual machine? Well, that's the purpose of Kata Containers, an Apache 2 licensed project of the OpenStack Foundation, emerged from Intel Clear Container project back in 2015:

"Kata Containers is an open source community working to build a secure container runtime with lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense."

Google looked at the problem of Container isolation differently and came up with gVisor, an open-source and OCI-compatible sandbox runtime that provides a virtualized container environment:

"gVisor integrates with Docker, containerd and Kubernetes, making it easier to improve the security isolation of your containers while still using familiar tooling. Additionally, gVisor supports a variety of underlying mechanisms for intercepting application calls, allowing it to run in diverse host environments, including cloud-hosted virtual machines."

I wanted to explore Kata container and gVisor runtime with the default runc based runtime of Docker, ideally all on a single server. Turns out, this is actually pretty simple to achieve. In this blog post I go thru the installation steps followed by launching various containers using all three runtimes via docker run and try out a few things, starting from basic connectivity (works) via what each reports for cpu and memory, up to reading and writing from volumes (works too), then running privileged mode and trying out a simple XDP code (didn't work).

Overall Kata and gVisor are very easy to use and will be sufficient for many basic container workloads. More exploration and performance testing will be needed before taking the plunge, but if container isolation is important, e.g. because multiple tenants must share the same host, then both offer great and simple solutions.

Installation

I'm using a baremetal server running Ubuntu 18.04.4 on a

15.18-custom #2 SMP Thu Feb 13 22:49:55 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
mwiget@jcrpd:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

$ lscpu |grep CPU
CPU op-mode(s):      32-bit, 64-bit
CPU(s):              16
On-line CPU(s) list: 0-15
CPU family:          6
Model name:          Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz
CPU MHz:             1000.021
CPU max MHz:         2301.0000
CPU min MHz:         1000.0000
NUMA node0 CPU(s):   0-15

Docker-ce

I followed the recommended installation path for docker-ce on Ubuntu:

sudo apt-get update
sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

This installs `/usr/bin/runc' as the docker runtime:

$ which runc
/usr/bin/runc

$ runc --version
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

$ dpkg -S /usr/bin/runc
containerd.io: /usr/bin/runc

Kata Container

Installation instructions can be found in the kata-containers documentation repo.

I followed the ubuntu-installation-guide.md for Ubuntu 18.04:

ARCH=$(arch)
BRANCH="${BRANCH:-master}"
sudo sh -c "echo 'deb http://download.opensuse.org/repositories/home:/katacontainers:/releases:/${ARCH}:/${BRANCH}/xUbuntu_$(lsb_release -rs)/ /' > /etc/apt/sources.list.d/kata-containers.list"
curl -sL  http://download.opensuse.org/repositories/home:/katacontainers:/releases:/${ARCH}:/${BRANCH}/xUbuntu_$(lsb_release -rs)/Release.key | sudo apt-key add -
sudo -E apt-get update
sudo -E apt-get -y install kata-runtime kata-proxy kata-shim

Then I followed the instructions Install Docker for Kata Containers on Ubntu via Docker /etc/docker/daemon.json. This will allow me later to define all three docker runtimes via the daemon.json file. This file doesn't exist yet and set the name for kata-runtime to kata. Default is set for now to kata. The Docker runtime is runc, which doesn't need to be defined in this file:

sudo cat | tee /etc/docker/daemon.json
{
    "default-runtime": "kata",
    "runtimes": {
        "kata": {
            "path": "/usr/bin/kata-runtime"
        }
    }
}
^D

Activate the new settings by restarting the Docker systemd service:

sudo systemctl daemon-reload
sudo systemctl restart docker

Now lets run an alpine container executing uname -a to display the name of the kernel using runc followed by kata. First, let's verify the kernel on the host itself:

~$ uname -a
Linux jcrpd 4.15.18-custom #2 SMP Thu Feb 13 22:49:55 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Now using kata, first by specifying the runtime kata directly, then using the default runtime (which is also kata, as defined in /etc/docker/daemon.json):

$ docker run -ti --runtime kata alpine uname -a
Linux 6ac4d4664e64 5.4.15-45.container #1 SMP Mon Feb 24 19:34:15 UTC 2020 x86_64 Linux

$ docker run -ti alpine uname -a
Linux 5393c2ec9b64 5.4.15-45.container #1 SMP Mon Feb 24 19:34:15 UTC 2020 x86_64 Linux

Clearly, the alpine container reports a much more recent kernel version than what is running on the host itself!

Running now the same container using runc:

$ docker run -ti --runtime runc alpine uname -a
Linux 85d13cf9bc8b 4.15.18-custom #2 SMP Thu Feb 13 22:49:55 UTC 2020 x86_64 Linux

As expected, runc is reporting the host kernel. More exploration of Kata to follow further down, but first, lets add gVisor runtime.

gVisor

Following the gVisor Installation Guide, I picked the install from an apt repository steps, starting with the dependencies:

sudo apt-get update && \
sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

Next, the key used to sign archives should be added to your apt keychain:

curl -fsSL https://gvisor.dev/archive.key | sudo apt-key add -

Now adding the repository for the latest release:

sudo add-apt-repository "deb https://storage.googleapis.com/gvisor/releases release main"

Now the runsc package can be installed:

sudo apt-get update && sudo apt-get install -y runsc

This added the binary /usr/bin/runsc and automatically updated our /etc/docker/daemon.json file:

$ cat /etc/docker/daemon.json
{
    "default-runtime": "kata",
    "runtimes": {
        "kata": {
            "path": "/usr/bin/kata-runtime"
        },
        "runsc": {
            "path": "/usr/bin/runsc"
        }
    }
}

Quick check if gVisor runtime is working: Run the alpine container via runsc:

$ docker run -ti --runtime runsc alpine uname -a
Linux d75634707851 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 Linux

Yes. Working. And uname -a reports yet another kernel from within the alpine container running on the same baremetal host, cool!

At this point, I have runc, kata and runsc (gVisor) runtimes at my disposal. Now let's explore them …

Exploration

Launch one alpine container per runtime

Time to have fun! First, let's start 3 alpine containers, each with another container runtime:

docker run -ti --runtime runc -d --rm --name r-runc --hostname r-runc alpine
docker run -ti --runtime runsc -d --rm --name r-gvisor --hostname r-gvisor alpine
docker run -ti --runtime kata -d --rm --name r-kata --hostname r-kata alpine

Now show the running containers:

kata-gvisor-docker$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
020c88d6f2ae        alpine              "/bin/sh"           About an hour ago   Up About an hour                        r-kata
7fe37b9d46cb        alpine              "/bin/sh"           About an hour ago   Up About an hour                        r-gvisor
605e747f595c        alpine              "/bin/sh"           About an hour ago   Up About an hour                        r-runc

To simplify executing the same command on all containers, I use the following helper script:

$ cat exec.sh
#!/bin/sh
for runtime in kata gvisor runc; do
  echo docker exec r-$runtime $@ ...
  docker exec r-$runtime $@
  echo ""
done

Lets try it to show the assigned IP address in each container:

$ ./exec.sh ip addr show eth0 
docker exec r-kata ip addr show eth0 ...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP qlen 1000
    link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.4/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:4/64 scope link 
       valid_lft forever preferred_lft forever

docker exec r-gvisor ip addr show eth0 ...
2: eth0: <UP,LOWER_UP> mtu 1500 
    link/generic 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.3/32 scope global dynamic 

docker exec r-runc ip addr show eth0 ...
31: eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

All containers share the same docker network bridge, which is 172.17.0.0/16 in my case. You cane use docker network inspect bridge to explore, but it produces a lot of output, including the IP's of all containers. What's interesting here is that gVisor runtime assigns a /32 address, whereas kata and default runc stick to the /16.

Connectivity

Pinging a public DNS server from each container works perfectly:

$ ./exec.sh ping -c3 1.1.1.1
docker exec r-kata ping -c3 1.1.1.1 ...
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=55 time=2.279 ms
64 bytes from 1.1.1.1: seq=1 ttl=55 time=2.261 ms
64 bytes from 1.1.1.1: seq=2 ttl=55 time=1.682 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.682/2.074/2.279 ms

docker exec r-gvisor ping -c3 1.1.1.1 ...
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=42 time=4.357 ms
64 bytes from 1.1.1.1: seq=1 ttl=42 time=1.775 ms
64 bytes from 1.1.1.1: seq=2 ttl=42 time=1.683 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.683/2.605/4.357 ms

docker exec r-runc ping -c3 1.1.1.1 ...
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=55 time=1.824 ms
64 bytes from 1.1.1.1: seq=1 ttl=55 time=1.401 ms
64 bytes from 1.1.1.1: seq=2 ttl=55 time=1.451 ms

--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.401/1.558/1.824 ms

What about connectivity between the containers? Well, seems to work too. Checking this by pinging one of the containers IP from the other ones, ok, including itself:

$ ./exec.sh ping -c 3 172.17.0.3docker exec r-kata ping -c 3 172.17.0.3 ...PING 172.17.0.3 (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: seq=0 ttl=64 time=0.352 ms
64 bytes from 172.17.0.3: seq=1 ttl=64 time=0.405 ms
64 bytes from 172.17.0.3: seq=2 ttl=64 time=0.348 ms

--- 172.17.0.3 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.348/0.368/0.405 ms

docker exec r-gvisor ping -c 3 172.17.0.3 ...
PING 172.17.0.3 (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: seq=0 ttl=42 time=0.400 ms
64 bytes from 172.17.0.3: seq=1 ttl=42 time=0.414 ms
64 bytes from 172.17.0.3: seq=2 ttl=42 time=0.408 ms

--- 172.17.0.3 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.400/0.407/0.414 ms

docker exec r-runc ping -c 3 172.17.0.3 ...
PING 172.17.0.3 (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: seq=0 ttl=64 time=0.302 ms
64 bytes from 172.17.0.3: seq=1 ttl=64 time=0.221 ms
64 bytes from 172.17.0.3: seq=2 ttl=64 time=0.216 ms

--- 172.17.0.3 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.216/0.246/0.302 ms

Cool! Connectivity between containers and the Internet works despite using different runtimes.

CPU

More interesting is, what CPU type, how many cores and memory is reported by each container. Starting with lscpu, which is part of the alpine package util-linux that we first have to install:

$ ./exec.sh apk add util-linux

Now ask for lscpu. This is interesting, Kata reports a single CPU (I'm sure this can be tuned, after all, this runs within qemu-kvm). gVisor reports 16 CPU's but doesn't distinguish between actual cores and threads and runc reports the actual host CPU capabilities.

mwiget@jcrpd:~/kata-gvisor-docker$ ./exec.sh lscpu
docker exec r-kata lscpu ...
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   40 bits physical, 48 bits virtual
CPU(s):                          1
On-line CPU(s) list:             0
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz
Stepping:                        4
CPU MHz:                         2294.604
BogoMIPS:                        4589.20
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        4 MiB
L3 cache:                        16 MiB
Vulnerability Itlb multihit:     Processor vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat pku md_clear

docker exec r-gvisor lscpu ...
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              16
On-line CPU(s) list: 0-15
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          unknown
Stepping:            unknown
CPU MHz:             1248.188
BogoMIPS:            1248.19
Virtualization:      VT-x
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves pku ospke

docker exec r-runc lscpu ...
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) D-2146NT CPU @ 2.30GHz
Stepping:                        4
Frequency boost:                 enabled
CPU MHz:                         1096.745
CPU max MHz:                     2301.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4600.00
Virtualization:                  VT-x
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        8 MiB
L3 cache:                        11 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

Looking at the running process for the gVisor container, we see the 16 assigned cpu's:

$ ps ax|grep qemu
17233 ?        Sl     0:09 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-020c88d6f2ae38b1b33788bf11d570d58931e7588a1b2412bf510c5ce2cd5650 -uuid b61941a4-1faf-4b9e-be8a-2c0b440730ca -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/vm/020c88d6f2ae38b1b33788bf11d570d58931e7588a1b2412bf510c5ce2cd5650/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=65110M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile= -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/020c88d6f2ae38b1b33788bf11d570d58931e7588a1b2412bf510c5ce2cd5650/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.11.0-alpha0_agent_d26a505efd.img,size=134217728 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng,rng=rng0,romfile= -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/vm/020c88d6f2ae38b1b33788bf11d570d58931e7588a1b2412bf510c5ce2cd5650/kata.sock,server,nowait -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/020c88d6f2ae38b1b33788bf11d570d58931e7588a1b2412bf510c5ce2cd5650,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3,fds=4 -device driver=virtio-net-pci,netdev=network-0,mac=02:42:ac:11:00:04,disable-modern=false,mq=on,vectors=4,romfile= -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-5.4.15.66-45.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=16 agent.use_vsock=false systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket -pidfile /run/vc/vm/020c88d6f2ae38b1b33788bf11d570d58931e7588a1b2412bf510c5ce2cd5650/pid -smp 1,cores=1,threads=1,sockets=16,maxcpus=16

Memory

$ ./exec.sh grep Mem /proc/meminfo
docker exec r-kata grep Mem /proc/meminfo ...
MemTotal:        2043288 kB
MemFree:         2008312 kB
MemAvailable:    1992656 kB

docker exec r-gvisor grep Mem /proc/meminfo ...
MemTotal:        2097152 kB
MemFree:         2095112 kB
MemAvailable:    2095112 kB

docker exec r-runc grep Mem /proc/meminfo ...
MemTotal:       65625052 kB
MemFree:        50505940 kB
MemAvailable:   51660856 kB

Only runc reports the actual available physical memory, gVisor and Kata report roughly 2GB each. Again, I'm sure this can be defined per container.

docker stats however still reports the actual memory usage of the container:

$ docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
f8de0ffbee4b        r-kata              0.00%               6.656MiB / 62.58GiB   0.01%               4.75kB / 2.27kB     0B / 0B             9
0a3024704e82        r-gvisor            0.03%               15.53MiB / 62.58GiB   0.02%               7.08kB / 0B         0B / 0B             158
db036e69cf1a        r-runc              0.00%               4.348MiB / 62.58GiB   0.01%               7.3kB / 0B          0B / 0B             1

Attaching a network at runtime

Create a new virtual network:

$ docker network create my-net
4b77bec0a4e4198841dde0823d370ead5e0cbaaf06afad62cfe471cd222bcdf7

And attaching it to all 3 containers:

docker network connect my-net r-kata
docker network connect my-net r-runc
docker network connect my-net r-gvisor

Cool, no errors are reported. Lets check if they are available within each container, but this only worked using runc, not in Kata nor gVisor. Looks like a trade off to be made in the name of isolation and multiple networks must be specified at launch.

$ ./exec.sh ip addr show 
docker exec r-kata ip addr show ...
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq state UP qlen 1000
    link/ether 02:42:ac:11:00:04 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.4/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe11:4/64 scope link 
       valid_lft forever preferred_lft forever

docker exec r-gvisor ip addr show ...
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/32 scope global dynamic 
2: eth0: <UP,LOWER_UP> mtu 1500 
    link/generic 02:42:ac:11:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.3/32 scope global dynamic 

docker exec r-runc ip addr show ...
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
31: eth0@if32: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever
40: eth1@if41: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
       valid_lft forever preferred_lft forever

Volumes

To test volume mounts, I destroyed the containers and launch them with the volume mount option of a host folder:

$ cat run-vol.sh 
#!/bin/bash
docker run -ti --runtime runc -v $PWD:/u -d --rm --name r-runc --hostname r-runc alpine
docker run -ti --runtime runsc -v $PWD:/u -d --rm --name r-gvisor --hostname r-gvisor alpine
docker run -ti --runtime kata -v $PWD:/u -d --rm --name r-kata --hostname r-kata alpine

All containers successfully mounted the current folder:

$ ./exec.sh ls /u
docker exec r-kata ls /u ...
exec.sh
install-kata.sh
install.sh
run-vol.sh
run.sh

docker exec r-gvisor ls /u ...
exec.sh
install-kata.sh
install.sh
run-vol.sh
run.sh

docker exec r-runc ls /u ...
exec.sh
install-kata.sh
install.sh
run-vol.sh
run.sh

What about writing to the folder? For this test to work, I had to modify the exec.sh script slightly, so I can use redirects within the container:

$ cat exec.sh 
#!/bin/sh
for runtime in kata gvisor runc; do
  echo docker exec r-$runtime sh -c "$@" ...
  docker exec r-$runtime sh -c "$@"
  echo ""
done

Now for the test itself:

$ ./exec.sh 'cat /etc/hostname >> /u/hostnames.txt'
docker exec r-kata sh -c cat /etc/hostname >> /u/hostnames.txt ...

docker exec r-gvisor sh -c cat /etc/hostname >> /u/hostnames.txt ...

docker exec r-runc sh -c cat /etc/hostname >> /u/hostnames.txt ...

mwiget@jcrpd:~/kata-gvisor-docker$ cat hostnames.txt 
r-kata
r-gvisor
r-runc

Bingo. Write works too. Every container added its hostname to the file hostnames.txt on the host via read-write mount point.

Don't trust my shell script ;-)? Lets try this one liner:

kata-gvisor-docker$ ./exec.sh 'touch /u/`hostname`'
docker exec r-kata sh -c touch /u/`hostname` ...

docker exec r-gvisor sh -c touch /u/`hostname` ...

docker exec r-runc sh -c touch /u/`hostname` ...

$ ls
exec.sh  hostnames.txt  install-kata.sh  install.sh  jcrpd  r-gvisor  r-kata  r-runc  run.sh  run-vol.sh

As expected, we see files named with the containers hostname.

XDP

I gave my XDP drop test container a go using gVisor and Kata, but both failed to load XDP code via ip link set dev eth0 xdp obj /xdp-drop.o sec drop_icmp:

Kata reported this:

Installing xdp-drop.o app on eth0 ...
mkdir /sys/fs/bpf failed: Operation not permitted
Continuing without mounted eBPF fs. Too old kernel?

Prog section 'drop_icmp' rejected: Function not implemented (38)!
 - Type:         6
 - Instructions: 11 (0 over limit)
 - License:      GPL

While gVisor runsc reported this:

Installing xdp-drop.o app on eth0 ...
Error: either "dev" is duplicate, or "xdp" is a garbage.
Makefile:16: recipe for target 'run-gvisor' failed

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

Up ↑

%d bloggers like this: