21.2 Capabilities: Adding and Dropping Linux Capabilities
Right, let’s talk about capabilities. This is where we stop treating a container like a dumb binary and start treating it like a process running on a Linux kernel, which is exactly what it is. The all-or-nothing model of “run as root” or “run as non-root” is brutally simplistic. Sometimes a process just needs to do one privileged thing, like bind to a port below 1024. Giving it full root access to do that is like giving someone the master key to the entire building because they need to get into the supply closet. It’s absurd.
Linux capabilities solve this by breaking down the monolithic power of root into a set of distinct, individual privileges. Your container’s process can hold just the specific capabilities it needs and nothing more. This is the cornerstone of writing least-privilege Pod specs. You’ll see two main fields in the security context for this: capabilities.drop and capabilities.add. The naming is a bit of a clue to the philosophy: first, we take everything away, then we add back only what’s strictly necessary.
The ALL Default and Why You Must Drop It
Here’s the first thing you need to know, and it’s a doozy: if you don’t specify anything, your container runs with a default set of capabilities. And that set includes NET_BIND_SERVICE… and also things like CAP_SYS_MODULE which lets it load kernel modules. This is, to put it mildly, bonkers. It’s a historical artifact, and it’s a terrible default for security. The Kubernetes designers were just mirroring the Linux default here, but in the context of containers, it’s a glaringly bad choice.
So, your first move, always, is to drop ALL capabilities and start from a blank slate. You do this in the securityContext of your container.
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example-container
image: nginx
securityContext:
capabilities:
drop:
- ALL
# ... other container specs
This drop: [ALL] is non-negotiable. It’s the equivalent of saying, “No, I do not trust you, random container image. You get nothing until you prove you need it.” This alone will neuter a huge number of potential container breakout exploits.
Adding Back Only What is Necessary
Now that your container is effectively running with the privilege level of a stunned hamster, you can grant it the specific capabilities it needs to function. This is where you need to actually understand what your process does.
Let’s say you have a container that needs to bind to port 443 (a privileged port). It needs NET_BIND_SERVICE. Here’s how you’d add just that one capability back.
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
spec:
containers:
- name: nginx
image: nginx
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
ports:
- containerPort: 443
See what we did there? We dropped the entire set and then added back exactly one capability. This is the gold standard. The process can now bind to any port it wants but it can’t change system time, raw socket packets, or mess with the kernel. Perfect.
The Most Dangerous Capabilities (The “Big Four”)
Not all capabilities are created equal. Some are relatively harmless, like NET_BIND_SERVICE. Others are essentially a backstage pass to the kernel. You should treat these with extreme prejudice. If you see a container that needs one of these, you should be deeply suspicious and probably look for a way to architect around it.
CAP_SYS_ADMIN: This is the big one. It’s a grab-bag of highly privileged operations. Granting this is almost as bad as giving full root. Just don’t.CAP_CHOWN: Allows changing file ownership. Often abused in breakout techniques.CAP_DAC_OVERRIDE: Allows ignoring file read, write, and execute permission checks. Basically, “I can read any file I want.” Huge red flag.CAP_SYS_MODULE: Allows loading and unloading kernel modules. If a container needs this, something is architecturally very wrong.
You’ll almost never have a legitimate reason to add these to a standard application container. If you’re considering it, you’re probably wrong. I’m not saying never, but I am saying you’d better have a brilliantly good reason and a team of senior engineers reviewing it.
Best Practices and The Auditing Shortcut
The best practice is simple: drop: [ALL] and then add: [] (an empty list). If you can get your container to run with zero added capabilities, you’ve achieved nirvana. Test for this. Try it. You’d be surprised how many things work perfectly well with no privileges at all.
If you’re staring at a legacy container image and have no idea what it actually needs, here’s a dirty but effective trick: run it in a monitored development environment with all capabilities, use a tool like capsh or getpcaps to see what capabilities it’s actually using, and then build your least-privilege policy from that.
# Get the process ID of your running container
kubectl exec -it <pod-name> -- ps aux
# Then, on the node, inspect the capabilities of that PID
getpcaps <PID>
It’s not perfect, but it’s a start. The real answer is to know your application. But in the real world, we often inherit applications we don’t fully understand, and this auditing method is a pragmatic first step towards locking them down. Remember, the goal isn’t to get it perfect on the first try; it’s to be significantly better than the default, which is tragically easy.