8.3 VolumeClaimTemplates: Per-Pod Persistent Volumes
Right, so you’ve got your StatefulSet humming along, giving you those lovely stable network identities and ordered pod management. But let’s be honest, the real reason you’re here, the thing that makes StatefulSets truly sing, is volumeClaimTemplates. This is where we move from ephemeral, flaky pods to having state that actually sticks around. Without this, you might as well just use a Deployment and call it a day.
Think of a volumeClaimTemplates as a cookie cutter. You define it once in your StatefulSet spec, and then for every Pod the StatefulSet creates (web-0, web-1, web-2, etc.), it uses that cookie cutter to stamp out a brand new PersistentVolumeClaim (PVC) specifically for that pod. This is the magic that gives each pod in your stateful application its own unique, persistent storage. No more musical chairs where a newly scheduled pod hopes it lands on the right node with the right data.
The Anatomy of a VolumeClaimTemplate
You don’t just wave a magic wand; you have to define the cookie cutter. This happens in the .spec.volumeClaimTemplates array of your StatefulSet manifest. It looks suspiciously like a standalone PersistentVolumeClaim, and that’s because it is, essentially. The key fields you’ll be wrestling with are accessModes and resources.requests.storage.
Here’s a template for a database pod that expects its own dedicated block of storage:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: "postgres"
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:13
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "fast-ssd" # You defined this with your admin, right?
resources:
requests:
storage: 10Gi
When you apply this, Kubernetes doesn’t just create the StatefulSet and its Pods. It also creates PVCs named data-postgres-0, data-postgres-1, and data-postgres-2. Each PVC will then, in the background, bind to a PersistentVolume (PV). The beauty is that this lifecycle is tied to the pod, not the StatefulSet. Delete the StatefulSet? The PVCs (and your precious data) stick around. This is a feature, not a bug—it prevents catastrophic accidental data loss.
The Sticky Truth About Pod Identity and Storage
This is the core concept you must internalize: Pod web-2 will always be bound to PVC data-web-2. Always. If Pod web-2 dies and gets rescheduled to a different node, the magic of the StatefulSet controller ensures it reattaches the exact same PVC (data-web-2). This provides that stable, predictable identity your stateful app craves. It’s why you can run distributed systems like Cassandra or Kafka on Kubernetes without losing your mind.
This also means you can’t just scale down a StatefulSet from 3 replicas to 1 and expect to later scale back up to 3 and have all your data. When you scale down, PVC data-web-2 is orphaned but still exists. Scaling back up recreates Pod web-1 and web-2 and reattaches their original PVCs. If you scaled down because web-2 was corrupted, scaling back up will just reattach the corrupted storage! Your repair procedure involves manually deleting the PVC for the faulty pod before scaling back up, so a new, empty PV can be provisioned for it. It’s a bit manual, but it’s the price of predictability.
Common Pitfalls and How to Avoid Them
The StorageClass Black Hole: The single biggest mistake is not defining a
storageClassNameor specifying one that doesn’t exist. If your StorageClass is missing, your PVCs will sit forever inPendinghell. Always, always check that yourkubectl get storageclassshows the one you’re referencing and that it’s marked as default or you’re explicitly calling it out. This is the number one support question, and I will judge you for it.ReadWriteOnce May Not Mean What You Think:
ReadWriteOncemeans the volume can be mounted as read-write by a single node. It does not mean “by a single pod”. If you schedule multiple pods onto the same node, they could theoretically mount the same RWO volume. This is usually a very bad idea for databases. Plan your node affinities accordingly.The Bill Comes Due: Remember, every PVC this template creates costs money. A 100-replica StatefulSet with a 10Gi template is 1 TiB of provisioned storage. If you’re on the cloud, that’s a line item on your monthly bill. Don’t get cute with huge replica counts unless you mean it.
Deletion Woes: As mentioned, deleting a StatefulSet leaves its PVCs behind. To fully nuke everything, you need to delete the StatefulSet with
--cascade=orphanand then delete the PVCs manually, or write a clever script. This is a bit of a rough edge, but it’s there to protect you from yourself. Embrace the manual control for production data.