40.3 etcd Cluster Sizing and Quorum Requirements
Right, let’s talk about what size etcd cluster you actually need. This isn’t a question of “bigger is better.” It’s a question of physics, failure domains, and the cold, hard math of consensus. Get it wrong, and your entire Kubernetes control plane grinds to a halt. No pressure.
The first and only rule you need to burn into your brain is: An etcd cluster must maintain quorum to function. This isn’t a suggestion; it’s the law of the land. Quorum is a majority of members. For a cluster of N members, quorum is (N/2) + 1. Let’s do the math because your entire production environment depends on it:
- 1 node: Quorum is
(1/2) + 1 = 1. This is a terrible idea. You have a single point of failure. It’s not a cluster; it’s a liability. Don’t do this in production. I see you thinking about it. Stop. - 2 nodes: Quorum is
(2/2) + 1 = 2. This is a worse idea. If either node fails, you lose quorum (1 node is not a majority of 2). The cluster becomes read-only and utterly useless. It’s a split-brain scenario waiting to happen. This is so bad it’s almost impressive. - 3 nodes: Quorum is
(3/2) + 1 = 2(floor division, remember?). This is the gold standard for most clusters. It can tolerate the failure of 1 member. You lose two, and you’re down. - 5 nodes: Quorum is
(5/2) + 1 = 3. It can tolerate the failure of 2 members. This is for larger, higher-availability setups. - 7 nodes: Quorum is
(7/2) + 1 = 4. Tolerates 3 failures. You’re either running a massive global cluster or you’re over-engineering. There are diminishing returns here, as more nodes mean more coordination overhead on writes.
The pattern is clear: always use an odd number of nodes. An even-numbered cluster (e.g., 4) provides no extra fault tolerance over the next lower odd number (e.g., 3) but is more expensive and complex. For a 4-node cluster, quorum is 3. You can only tolerate 1 failure, same as a 3-node cluster. So why would you pay for a fourth node that gives you nothing? You wouldn’t. I trust you’ve learned your lesson from the 2-node disaster scenario.
Sizing for Performance, Not Just Failure
Quorum is about survival, but you also need to think about performance. A 3-node cluster is highly available, but is it big enough for your 500-node Kubernetes cluster pounding it with requests?
You need to look at the key metrics: request throughput and backend commit latency. etcd is brutally sensitive to disk I/O latency. If your writes are slow, everything slows down. etcdctl is your best friend here for checking the cluster’s vitals.
# Check overall cluster health
etcdctl endpoint health --cluster
# Check alarm status (you want a clean output here)
etcdctl alarm list
# The most important metric: 99th percentile latency on the underlying disk writes
etcdctl check perf
# You can also use the built-in metric endpoint for Prometheus-style scraping
curl -s http://localhost:2379/metrics | grep -E "(disk_write|wal_fsync|commit_time)"
If your disk_write or wal_fsync duration is consistently high (e.g., over 10ms), your cluster is gasping for air. This is almost always an I/O problem. You need faster disks. NVMe SSDs are non-negotiable for any serious production workload. Spinning disks are a joke etcd doesn’t get. Also, ensure you’ve given it enough memory; etcd is an in-memory key-value store that periodically snapshots to disk, so it will use as much RAM as you give it to cache keys.
The Nightmare Scenario: Loss of Quorum
So what actually happens when you lose quorum? Let’s say you have a 3-node cluster and nodes A and B get hit by a meteor (or a misconfigured deployment). Node C is now alone. It cannot form a majority. It becomes a read-only zombie. It has all the data, but it can’t confirm any new writes.
Your Kubernetes API servers, which are clients of etcd, can’t write anymore. This means no new Pods, no new Deployments, no scaling events. The cluster is functionally dead for any changes. Your existing workloads might keep running, but you’re now on a ticking clock.
To recover, you’re not just restarting the failed nodes. You are now in the terrifying world of disaster recovery, which likely involves taking the remaining node down, restoring from a snapshot onto all three, and bringing them back up. It’s a multi-hour outage. Practice this in a lab before you need to do it at 3 AM. I’m not joking.
The Best Practice You’ll Ignore (But Shouldn’t)
Place your etcd members on dedicated, isolated hardware (or equivalent VM/machine specs). Do not colocate them with your Kubernetes masters. Do not run them on the same physical host. The etcd docs are weirdly casual about this, but it’s critical. The I/O and network load from your control plane components can and will affect etcd’s delicate timing, leading to leader elections and timeouts. Give etcd its own machines, its own fast disks, and a low-latency, reliable network. Treat it like the precious, fragile queen it is. Your cluster’s stability depends on it.