40.5 etcd Performance Tuning: Defragmentation and Compaction
Alright, let’s talk about keeping your etcd cluster from grinding to a halt under the weight of its own history. Think of etcd as the meticulous, slightly obsessive librarian of your Kubernetes cluster. It keeps a perfect, immutable ledger of every single change (put, delete) you’ve ever made. This is brilliant for reliability and disaster recovery, but if you never throw out the old newspapers, eventually the library becomes a fire hazard and the librarian starts having a panic attack. That’s where compaction and defragmentation come in—they’re our janitorial service for the key-value store of truth.
The Why: Compaction vs. Defragmentation
First, let’s untangle these two terms because they solve very different, albeit related, problems.
Compaction deals with the logical size of your database—the history. Every time you change a key, etcd appends a new revision. Over time, this history of revisions grows infinitely. Compaction is the process of telling etcd, “Right, we don’t need a record of what the nginx-deployment replica count was three months ago. Please just keep the latest revision and everything since.” It truncates the revision history, freeing up internal resources and preventing the cluster from slowing down when searching through a million ancient revisions. It does not, however, give disk space back to the OS.
Defragmentation deals with the physical size on disk. etcd uses a BoltDB backend, which stores data in pages. When you update or delete a key, the space it occupied on disk becomes free within the etcd database file but fragmented. Imagine a bookshelf where you remove books from the middle; you have free space, but it’s in useless, small chunks. When you need to add a new large book (a big value), it can’t fit in the small gaps. Defragmentation is the process of rewriting the entire database file to consolidate this free space, making it contiguous and usable again, and crucially, it can shrink the file and return disk space to the OS. It’s a costly, I/O-intensive operation, but it’s essential.
How to Run a Compaction
You compact by specifying a revision number. Everything before that revision is purged from the history. The key question is: which revision do you choose? A best practice is to use the --auto-compaction flag when starting etcd to handle this automatically. But since you’re reading a tuning guide, you probably want to understand the manual process.
First, find your current revision. The response from any request will include it, but getting a member’s metrics is a clear way.
# Get the current revision and other metrics
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key endpoint status --write-out=json
Look for the "header"."revision" field in the output. Now, to compact, you pick a revision in the past. A common strategy is to compact to a revision that’s N revisions behind the current one. For most clusters, compacting to the revision from one hour ago is a safe bet.
# Let's say the current revision is 1234567. We'll compact to 1200000.
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key compact 1200000
Crucial Pitfall: Any API consumer that relies on watching from a historical revision older than your compaction point will break. Spectacularly. This is the number one reason people get nervous about this operation. If you have any such clients (e.g., custom operators with buggy watch logic), you need to fix them first. Kubernetes itself handles this correctly.
How to Run a Defragmentation
Defragmentation is a more serious operation. It rewrites the entire database file, which requires a lot of I/O and temporarily increases the latency of the etcd member you’re defragging. You must defragment each member in the cluster one at a time. Defragging all members simultaneously will very likely cause a cluster outage as leader election goes haywire under the load.
Here’s how you defrag a single member. Note the --dial-timeout flag. This is important. The defrag operation can take a while on a large database, and you don’t want your etcdctl client giving up halfway through.
# Defragment a single member
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key defrag --dial-timeout=60s
After defragging one member, move on to the next. How do you know if you even need to defrag? Check the etcd_server_quota_backend_bytes metric (the total backend quota) and the etcd_mvcc_db_total_size_in_bytes (the actual database size). If the latter is approaching the former, you’re running out of space within your etcd database file, and defragmentation is urgently required.
Putting It All Together: Automation and Best Practices
Doing this manually is for chumps and people who enjoy 3 AM pages. The goal is automation.
- Enable Auto-Compaction: This is non-negotiable. Start your etcd servers with
--auto-compaction-retention=1h(or a value suitable for your needs). This tells etcd to automatically compact the revision history every hour, keeping things tidy. - Monitor DB Size: Use the metrics mentioned above (
db_total_size_in_bytes) to set an alert that fires when it reaches, say, 80% of the backend quota. This is your trigger for a defragmentation cycle. - Defrag During Maintenance Windows: Since defrag is disruptive, script it to run during known quiet periods. Write a script that iterates through each etcd member endpoint (found in the
etcdPod manifest for a kubeadm setup) and runs the defrag command against it one by one, with generous timeouts.
The designers made a questionable choice here by not building a fully automated, cluster-aware defragmentation process. It’s arguably because the I/O cost is so high they didn’t want it happening by surprise. But it does mean the onus is on you, the operator, to build the automation around it. Consider it a rite of passage. Now go clean your library.