44.7 Security Boundaries: Why Containers Are Not VMs

Right, let’s get this out of the way immediately: a container is not a virtual machine. If you walk away from this chapter remembering one thing, let it be that. The marketing departments of various companies have done a fantastic job of blurring the lines, but you and I are technical people, and we deal in truths, not brochures. A VM is a full-blown guest operating system, virtualizing hardware, sitting on top of a hypervisor. A container is just a process. A fancy, wrapped-up, slightly narcissistic process that thinks it’s the center of the universe, but a process nonetheless. Its isolation comes from two kernel features: cgroups (which limit resources) and namespaces (which limit visibility). This is a security boundary, but it’s a fence, not a fortress wall.

44.6 OCI: The Open Container Initiative Standard

Right, so you’ve got your head around cgroups and namespaces—the raw, kernel-level primitives that let us box processes up. Powerful stuff, but also a bit like being handed a pile of lumber, a box of nails, and a saw. You could build a house with it, but you’d probably rather have a blueprint and some pre-fab walls. That’s where the Open Container Initiative, or OCI, comes in. It’s the blueprint.

44.5 Container Runtimes: runc, crun, containerd, CRI-O

Right, so you’ve got this container image. It’s a neat little tarball with some metadata, all wrapped up according to the OCI spec. Wonderful. But a container image is not a container. It’s a blueprint. Something has to actually unpack that blueprint, wire up the kernel isolation features we talked about (the namespaces and cgroups), and run the process. That “something” is the container runtime. And this is where the landscape gets… interesting. Let’s untangle the wonderful hierarchy of tools that actually make docker run happen.

44.4 Creating a Container Manually with unshare and chroot

Right, so you want to build a container. Not pull one from a registry, not write a Dockerfile and let a daemon do the heavy lifting. You want to get your hands dirty and build one from scratch. Excellent. This is where you stop waving at the ship and start learning how the engine room works. It’s messy, it’s manual, and it’s the single best way to understand what the hell is actually happening when you run docker exec.

44.3 How Docker and Podman Use Namespaces and cgroups

Alright, let’s pull back the curtain. When you type docker run or podman run, you’re not just asking for a container. You’re asking these tools to be your personal stage manager for a one-act play starring your application. Their job is to use Linux’s core features—namespaces and cgroups—to build the set, cast the actors, and enforce the rules of the performance. The magic isn’t in the tools themselves; it’s in how they wield these underlying kernel facilities. They’re just particularly good stage managers.

44.2 cgroups: Resource Accounting and Limiting (v1 vs v2)

Alright, let’s get our hands dirty with cgroups. If namespaces are about providing isolation (making you think you’re alone), cgroups are the prison guards enforcing the rules. They’re about resource accounting and limiting. They answer the crucial question: “How much CPU, memory, and I/O can this process, and all its future children, actually use?” You’ll run into two flavors: the old, slightly dysfunctional v1 and the newer, more coherent v2. The Linux kernel maintainers looked at the glorious mess of v1 and said, “We can do better.” And they did. v2 is a significant redesign, not just an incremental update. The key difference is philosophical: v1 let you control different resources (CPU, memory, I/O) with multiple, independent hierarchies. v2 enforces a single, unified hierarchy. This sounds boring, but it’s the difference between herding a dozen cats and commanding a well-drilled squad of soldiers. v1 was the cats.

44.1 Linux Namespaces: The Isolation Primitive (pid, net, mnt, uts, ipc, user)

Right, let’s talk about namespaces. If cgroups are the resource accountants, namespaces are the office architects who build the walls, install the soundproofing, and give everyone a separate phone line. They are the fundamental isolation primitive in Linux. Without them, a container is just a fancy, jailed process. With them, a process can be given the utterly unshakable illusion that it is the only process on a machine, with its own network, its own hostname, and its own file system. It’s a brilliant magic trick, and like all good magic, it relies on a healthy dose of misdirection.

— joke —

...