40.1 What etcd Is: Distributed Key-Value Store with Raft Consensus

Right, let’s talk about the thing that makes your entire Kubernetes cluster tick. If the Kubernetes API server is the brain, etcd is the heart. It’s the single source of truth, the sacred ledger where every last detail about your cluster’s desired state is meticulously recorded. And if it stops, your cluster flatlines. No pressure.

At its core, etcd is a distributed, consistent key-value store. I know, “distributed consistent” sounds like corporate mission-statement jargon, but it’s the most important part. It means you can have multiple etcd servers (which we call a cluster), and they will all agree on what the data is, even if some of them fail or get disconnected. They present a single, logical view of the data to clients like the API server. This isn’t some sloppy eventually-consistent NoSQL database; this is the real deal. It achieves this magic trick through a consensus algorithm called Raft. We’ll get to that in a second.

The Key-Value Model is Deceptively Simple

Don’t let the simplicity fool you. This isn’t a filesystem or a document database. It’s a key-value store, and it’s ruthlessly efficient because of it. Every piece of data is stored under a key, which is just a byte sequence, often structured like a path (e.g., /registry/pods/default/my-app-pod). The value is also just a byte sequence. That’s it. This minimalism is its superpower. The Kubernetes API server is responsible for taking its rich object model (Pods, Deployments, etc.) and serializing it into the data structures that get shoved into these etcd values. It’s like a brilliant librarian who uses a ruthlessly efficient Dewey Decimal system; you just ask for a book by its number, and they hand you the exact, un-opinionated box of pages.

You interact with it primarily through its gRPC API (though there’s an HTTP/JSON gateway for quick and dirty stuff). Let’s use etcdctl, the command-line tool, to see it in action. It’s the Swiss Army knife for poking at your cluster’s brain.

# Let's put a key. This is like writing a single fact into the ledger.
etcdctl put /guide/hello "world"

# Now let's get it back. The universe is in order.
etcdctl get /guide/hello
# Output: /guide/hello world

# Let's make it more Kubernetes-y. This is the kind of key the API server writes.
etcdctl put /registry/configmaps/default/my-config '{"kind":"ConfigMap","apiVersion":"v1","data":{"app.properties":"debug=true\n"}}'

# You can also get a range of keys. This is how the API server 'lists' all resources of a type.
etcdctl get /registry/configmaps/default/ --prefix

Raft: The Consensus Algorithm That Actually Works

This is where the real genius lies. etcd uses the Raft consensus algorithm because the other big one, Paxos, is famously difficult to understand and implement correctly. The etcd authors are smart; they chose the algorithm designed for understandability. Thank them for it later.

Raft’s core idea is that a cluster elects a single leader. All write requests (like put or delete) must go to the leader. The leader then says, “Hey, fellow etcd members, I propose we all write this new data.” This is called appending an entry to the log. Once a majority of the nodes (a quorum) acknowledge they’ve durably stored this log entry, the leader considers it committed. It then applies the entry to its own key-value store and informs the client that the write was successful. Only then can it send a response back to a client like your etcdctl command.

This is why your etcd cluster must have an odd number of members. It’s all about that majority. A 3-node cluster can tolerate 1 failure. A 5-node cluster can tolerate 2. An even number, like 4, only gives you a failure tolerance of 1 (because the majority of 4 is 3), so you’ve added cost and complexity for no extra fault tolerance. Just use 3 or 5. Seriously.

The Rough Edges and Pitfalls You Will Hit

This elegant system has its sharp corners. The first is performance. etcd is not a blob store. The value under each key should be small. The default maximum request size is just 1.5MB. If you or a controller ever tries to shove a massive configuration or a hefty secret into a resource, the API server will happily serialize it and try to put it into etcd, which will promptly reject it. This manifests as obscure errors. Keep your ConfigMap and Secret data lean.

The second, more insidious pitfall is quorum. If you lose a majority of your etcd members (e.g., 2 out of 3), the cluster is hosed. It loses quorum and cannot accept any writes. It’s a chicken-and-egg problem: to recover a node and add it to the cluster, you need to perform a write… but you can’t because you have no quorum. This is why your backups are absolutely non-negotiable. You will need to restore from a snapshot someday.

# This is not a joke. Do this religiously on a production cluster.
# Take a snapshot
ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS snapshot save snapshot.db

# Check the snapshot status to see if it's valid
ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshot.db

# Restore a cluster from a snapshot (this wipes the data dir!)
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db --initial-cluster my-etcd-1=http://... --initial-advertise-peer-urls http://...

The final, often overlooked, issue is version skew. The version of etcdctl must match the version of the etcd server API. The ETCDCTL_API=3 env var isn’t just a formality; it’s a crucial switch. Using an API version 2 tool to talk to a version 3 server will give you confusing, empty results. It’s the most common “why can’t I see my data?!” problem. Always be explicit. The designers made a questionable choice here by not making v3 the default sooner, but we live with it.