44.3 etcd Performance: SSD Requirements and Compaction

Right, let’s talk about the brain of your Kubernetes cluster: etcd. If the API server is the charismatic frontman of the band, etcd is the meticulous, hyper-organized manager in the back without whom the whole tour collapses into chaos. It’s a distributed key-value store, and its sole job is to remember the state of absolutely everything in your cluster. And because we’re asking it to do this consistently and quickly, it gets… particular. Performance-wise, if your etcd is unhappy, your entire cluster is unhappy. Pods won’t schedule, deployments will hang, and you’ll be left staring at a kubectl get pods that hasn’t updated in minutes.

The single most important thing you can do for etcd performance, bar none, is to give it the right hardware. This isn’t a suggestion; it’s a requirement.

The Non-Negotiable SSD Rule

Let’s be direct: you must run etcd on an SSD. Not a “high-performance” cloud disk that might be an SSD, not a RAID array of spinning rust—a proper local SSD or provisioned IOPS SSD block store with low, consistent latency.

Why? Because etcd’s entire consensus protocol (Raft) is a relentless log writer. Every change—a pod creation, a configmap update, a secret deletion—is a write operation that must be persisted to disk and replicated before the API server even acknowledges it. This is called a Write-Ahead Log (WAL). If that disk write is slow, every single Kubernetes operation queues up behind it. A spinning disk (HDD) has seek times measured in milliseconds; an SSD has them in microseconds. That difference is the difference between a snappy cluster and one that feels like it’s running in molasses. The etcd docs state this explicitly, and they are not joking. I’ve seen teams try to cut corners here. They always, without fail, end up in a world of pain that costs far more in engineering time than the SSD ever would have.

Taming the Datastore: Compaction and Defragmentation

Here’s where things get interesting. etcd doesn’t just write; it writes versions. It keeps a history of key changes. This is brilliant for allowing things like watch operations and simple rollbacks, but left unchecked, it will fill your disk and slowly strangle performance. This is the etcd equivalent of never throwing out old receipts.

You have two primary tools to manage this growth: compaction and defragmentation.

Compaction is the process of literally throwing away those old receipts. It removes all historical data older than a specific revision. If you compact revision N, you can no longer access the state of the world at any revision before N. This is crucial for controlling disk usage. The key is to do it automatically.

# Check the current revision. You'll need this.
etcdctl endpoint status --write-out=json --cluster | grep -o '"revision":[^,]*'

# Compact up to revision 1510000
etcdctl compact 1510000

Now, doing this manually is a chore. This is why you should absolutely enable auto-compaction. The classic way is to run it based on retention time. A good starting point is to keep, say, two hours of history.

# This is a flag for your etcd server process, not a command you run.
# It tells etcd to auto-compact revisions older than 2 hours.
--auto-compaction-retention=2h

But wait! There’s a catch. Compaction only logically removes the data; it doesn’t actually give the space back to the filesystem. The database file remains the same size. This is where defragmentation comes in. Defragmentation is the process of reclaiming this physical disk space. It rewrites the database file into a new, smaller file, freeing up the unused blocks.

Warning: Defragmentation is a resource-intensive operation. It can easily take several seconds during which etcd may not be responsive to requests. Never, ever run it on all your etcd members at once unless you enjoy causing cluster outages.

# Defragment a single etcd member endpoint
etcdctl defrag http://etcd-server-1:2379

# Check the database size before and after to see the magic
du -sh /var/lib/etcd/member/snap/db

Best practice? Defragment periodically, during off-peak hours, and one cluster member at a time. Monitor the etcd_disk_backend_commit_duration_seconds metric before and after. A significant drop in latency confirms you’ve done a good thing. Some operators even automate this, defragmenting a member if the database size grows beyond a certain threshold post-compaction.

The Memory Factor

Don’t forget about RAM. etcd will happily use a large portion of its memory to cache keys for fast reads. This is a good thing! A general rule of thumb is to not skimp. If your etcd dataset is 8 GB, giving it only 4 GB of RAM is a terrible idea—it’ll be constantly swapping to disk, defeating the entire purpose of that beautiful SSD you just bought. Monitor etcd_memory_usage and ensure it has plenty of headroom.

The takeaway? Treat etcd like the mission-critical, high-performance database it is. Give it the fast disk it demands, automatically clean up after it with compaction, and carefully defragment its home. Do this, and your cluster’s brain will remain sharp, focused, and ready for whatever you throw at it.