43.7 Minimizing Attack Surface: Removing Unnecessary Capabilities

Alright, let’s talk about tightening the screws. You’ve got a pod running. Great. But right now, by default, it’s probably a digital hoarder’s dream, packed with capabilities it has no business having. Our goal is to turn it into a minimalist’s paradise. We’re going to strip it down to only what it absolutely needs to function. This isn’t about being mean to your application; it’s about being ruthless on its behalf. If an attacker breaks in, we want them to find a beautifully empty room with no tools to escalate their privileges or move laterally.

The principle is simple: if your app doesn’t explicitly need a capability, revoke it. The most common way an attacker goes from “I have code execution in your container” to “I own your entire cluster” is by leveraging these unnecessary privileges. We’re going to slam that door shut.

The Low-Hanging Fruit: Dropping All Capabilities

On a standard Linux system, root (UID 0) is god. In a container, that’s… overkill. The Linux kernel breaks down root’s power into distinct units called capabilities, like CAP_NET_RAW (for raw socket access) or CAP_SYS_ADMIN (which is basically a skeleton key). Your puny web server doesn’t need to be god. It probably just needs to bind to a port below 1024 (CAP_NET_BIND_SERVICE).

The single most impactful thing you can do is start from a baseline of nothing and only add back what you need. Kubernetes and Docker, in their infinite wisdom, don’t do this by default. They grant a container a list of capabilities that, while not full root, is still terrifyingly powerful. Let’s fix that.

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  containers:
  - name: my-app
    image: nginx:alpine
    securityContext:
      capabilities:
        drop:
          - ALL
        add: [] # Explicitly add nothing. This is the key.

This drop: [ALL] is your new best friend. It revokes every single capability. Try deploying this. Your Nginx container will immediately crash. Why? Because it needs to bind to port 80, which is a privileged port (below 1024). It needs CAP_NET_BIND_SERVICE. This is a good crash. It’s telling us what we need to add back.

The Surgical Approach: Adding Only What’s Necessary

Now, instead of giving up and running as root, we practice some security surgery. We figure out the one or two capabilities the app genuinely requires and add only those back.

apiVersion: v1
kind: Pod
metadata:
  name: slightly-less-secure-pod
spec:
  containers:
  - name: my-app
    image: nginx:alpine
    securityContext:
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE # Just this one. Nothing else.

This pod will run. It can bind to port 80. But it can’t do much else. An attacker who compromises this container won’t be able to mount filesystems, change the system clock, or debug processes—all common ways to break out of a container. You’ve just neutered a huge class of exploits with one YAML stanza.

The Usual Suspects: Capabilities to Be Wary Of

While you should always start from drop: [ALL], here are a few capabilities that are particularly dangerous and should set off alarm bells if you think you “need” them:

CAP_SYS_ADMIN: This is the big one. It’s a grab-bag of administrative privileges. Needing this usually means you’re doing something that is deeply antagonistic to the concept of containerization. If you add this, you might as well just run as root. You’ve already lost.
CAP_NET_RAW: This allows raw sockets, which is the backbone of tools like ping and traceroute. It’s also used in various DNS poisoning and network-scanning attacks. If your app isn’t a network diagnostics tool, it doesn’t need this.
CAP_DAC_OVERRIDE: This allows ignoring file permissions. It’s a brute-force capability that bypasses read/write/execute checks. It’s a massive red flag for privilege escalation.

Beyond Capabilities: The readOnlyRootFilesystem

Capabilities are a huge win, but let’s not stop there. Another brilliant, simple, and brutally effective hardening technique is making the root filesystem read-only. Think about it: your container’s base image is immutable. Your application shouldn’t be writing to /usr/bin or /lib. It should only write to specific volumes, like /tmp or /var/log. Making the root filesystem read-only prevents an attacker from tampering with your application binaries, installing tools, or writing malicious scripts to common locations.

apiVersion: v1
kind: Pod
metadata:
  name: locked-down-pod
spec:
  containers:
  - name: my-app
    image: nginx:alpine
    securityContext:
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE
      readOnlyRootFilesystem: true # This is the magic line
    volumeMounts:
    - name: tmp-volume
      mountPath: /tmp
    - name: cache-volume
      mountPath: /var/cache/nginx
  volumes:
  - name: tmp-volume
    emptyDir: {}
  - name: cache-volume
    emptyDir: {}

See what we did? We made / read-only, but we knew Nginx needed to write to /tmp and /var/cache/nginx, so we provided writable emptyDir volumes for those specific paths. This is the essence of hardening: understanding your app’s actual needs and meeting them precisely, without giving it the keys to the kingdom. An attacker who gets in now can’t even install curl to download more malware. It’s beautifully frustrating—for them.