44.7 Security Boundaries: Why Containers Are Not VMs
Right, let’s get this out of the way immediately: a container is not a virtual machine. If you walk away from this chapter remembering one thing, let it be that. The marketing departments of various companies have done a fantastic job of blurring the lines, but you and I are technical people, and we deal in truths, not brochures. A VM is a full-blown guest operating system, virtualizing hardware, sitting on top of a hypervisor. A container is just a process. A fancy, wrapped-up, slightly narcissistic process that thinks it’s the center of the universe, but a process nonetheless. Its isolation comes from two kernel features: cgroups (which limit resources) and namespaces (which limit visibility). This is a security boundary, but it’s a fence, not a fortress wall.
The Shared Kernel: Your Single Point of Failure
The most critical distinction is the kernel. Every single container on a host shares the exact same kernel as the host itself. This isn’t a theoretical point; it’s a gaping architectural reality. If your containerized application finds a clever way to escalate privileges (and oh, they do), it breaks out into the host operating system. Game over. In a VM scenario, a kernel exploit in the guest might compromise that VM, but the hypervisor and other VMs are, in a well-configured system, a separate battle.
Think of it like an apartment building. VMs are individual, self-contained units with their own plumbing and electrical (their own OS). A fire (a kernel panic) in one unit might smoke-damage others, but it’s somewhat contained. Containers are just rooms in one big unit (the host OS). If someone kicks down the drywall between rooms (a kernel exploit), they have the run of the entire place. This is why you never, ever run untrusted code in a container on a valuable host.
The Illusion of Isolation: Namespaces
Namespaces are the magic trick that makes a process feel alone. They wrap a set of system resources and present the illusion that the process has its own isolated instance. The most important ones for this discussion are the pid, net, mnt, and user namespaces.
For example, a process in a PID namespace might think it’s PID 1 (the init process) and the glorious parent of all it surveys. From the host’s perspective, it’s just some random process with a much higher, less important PID.
# On the host, let's find a container's "init" process
sudo docker run -d --name my_nginx nginx
container_pid=$(sudo docker inspect my_nginx --format '{{.State.Pid}}')
echo "From the host, the Nginx 'init' process is actually PID: $container_pid"
# Now let's see what it looks like FROM INSIDE the container
sudo nsenter -t $container_pid -p ps aux
The nsenter command lets you jump into the namespaces of another process. Inside, you’ll see a clean, minimal process list headed by your nginx master process as PID 1. From the outside, it’s just one tree in a vast forest. This isolation is good, but it’s not comprehensive. The kernel is still shared, and some resources, notably the kernel keyring and some system time calls, aren’t as neatly namespaced as we’d like.
Resource Limits: Control Groups (cgroups)
While namespaces provide isolation, cgroups do the equally important job of accounting and limiting resources. They are the reason one greedy container can’t bring the entire host to its knees by eating all the RAM or CPU. This is a hard boundary, and it’s crucial for multi-tenant environments.
Here’s how you might see them on a modern system using cgroups v2 (which is what you should be using now, by the way):
# Find the cgroup for our running container
cat /proc/$container_pid/cgroup
# You'll get a path like: 0::/docker/abcdef12345...
# Now let's look at its memory limits. The path is usually under /sys/fs/cgroup/
cg_path=$(cat /proc/$container_pid/cgroup | awk -F: 'NF==3 {print $3}')
echo "Memory max for this container:"
cat /sys/fs/cgroup/$cg_path/memory.max
This file, memory.max, contains the hard limit. If the container tries to exceed it, the OOM (Out-Of-Memory) killer will step in and mercilessly kill processes inside the container to enforce the limit. This protects the host, but it’s a blunt instrument for the container itself.
The Gap: Where “Secure Enough” Isn’t Secure
So, where does this model fall apart? Everywhere the kernel interface is broad and poorly namespaced. The most classic example is the sysctl kernel parameter kernel.shm_mlock. A container with the SYS_ADMIN capability (often granted to things that foolishly think they need to mount filesystems) can lock unlimited amounts of memory into RAM, completely bypassing the cgroup memory limits. It’s a fantastic way to DoS your own host.
This is why the principle of least privilege isn’t just a best practice; it’s the only way to run containers with a semi-straight face. You must drop all capabilities (--cap-drop=ALL) and then add back only the specific ones you need (--cap-add=NET_ADMIN). You should also run as a non-root user inside the container (-u 1000:1000) and use a user namespace to map that user to a high UID on the host, adding another much-needed layer of separation. Docker, in its infinite wisdom, doesn’t enable user namespaces by default because it “breaks things.” A telling choice.
The bottom line? Containers are an incredible tool for packaging software, managing dependencies, and achieving density. But they are a isolation mechanism built on a foundation of shared trust in a single, massive kernel codebase. Use them to isolate your own stuff from itself, not to isolate yourself from a malicious actor. For that, you still need the thick walls of a VM or, better yet, a physical machine.