40.4 Backing Up and Restoring etcd

Right, let’s talk about the crown jewels. Your entire Kubernetes cluster—every pod, every service, every secret, every existential thought your cluster has ever had—is stored in one place: etcd. It’s the single source of truth. This makes it both the most critical component and your biggest single point of failure. So, if you’re not backing it up, you’re basically flying a million-dollar jet with no parachute and praying the engines don’t so much as cough. Let’s fix that.

The core concept is simple: we take a point-in-time snapshot of etcd’s key-value store. But the how depends on whether you’re dealing with a managed Kubernetes service (where they often handle this for you, but you should verify) or a cluster you’ve set up yourself, typically with kubeadm. We’ll focus on the latter, because that’s where you have to get your hands dirty.

Taking a Snapshot: The `etcdctl` Way

You’ll use the etcdctl tool. The first gotcha: the API version. etcd v3 API is what Kubernetes uses, but your etcdctl might default to the v2 API, which would give you a blank, useless snapshot. You must explicitly set the API version and the connection details. Here’s how you do it properly.

First, you need the right flags. You can’t just run this from anywhere; you need to run it on the actual etcd server node, authenticating with its client certificates. These are usually hanging out in /etc/kubernetes/pki/etcd/.

# This is the full, no-nonsense command. Run it on your etcd master node.
sudo ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /path/to/your/snapshot.db

Why all the certificates? etcd is fiercely paranoid, as it should be. It communicates over TLS with mutual authentication (--cert and --key) to ensure only authorized clients (like the API server) can talk to it. The --cacert tells etcdctl to trust the etcd CA. Without this trio, you’re getting a door slammed in your face.

What About a Static Pod? Getting the Endpoint Right

If you used kubeadm, your etcd is probably running as a static pod. You might see its manifest in /etc/kubernetes/manifests/etcd.yaml. Peek at that file (sudo cat /etc/kubernetes/manifests/etcd.yaml). You’ll notice it uses a localhost listener. This is why our --endpoints is https://127.0.0.1:2379 and not some fancy cluster IP. It’s listening on the local loopback interface. Trying to use the node’s external IP here will fail spectacularly because the certificates likely only have the node name and localhost as valid SANs (Subject Alternative Names). Another one of those “questionable choices” that makes sense for security but trips everyone up.

Actually Restoring From Backup (The “Break Glass” Procedure)

Here’s the part everyone hopes they never need. Restoring isn’t a gentle process; it’s a hard reset. You’re telling etcd to obliterate its current state and start fresh from your snapshot. This means you must do this on a fresh cluster or be prepared for a complete outage on the current one.

The process is destructive, so we do it offline. Stop the kube-apiserver first, or it will freak out as its database suddenly vanishes from underneath it.

# 1. Move the existing data dir out of the way. DON'T SKIP THIS.
sudo mv /var/lib/etcd /var/lib/etcd.old

# 2. Restore the snapshot to a *new* data directory
sudo ETCDCTL_API=3 etcdctl snapshot restore /path/to/your/snapshot.db \
  --data-dir /var/lib/etcd/new \
  --name $(sudo cat /etc/kubernetes/manifests/etcd.yaml | grep name | awk '{print $2}') \ # Usually 'my-master-node'
  --initial-cluster "$(sudo cat /etc/kubernetes/manifests/etcd.yaml | grep initial-cluster | awk -F= '{print $2}' | tr -d '\"')" \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls "$(sudo cat /etc/kubernetes/manifests/etcd.yaml | grep initial-advertise-peer-urls | awk -F= '{print $2}' | tr -d '\"')"

# 3. Update your etcd static pod manifest to use the new data dir.
# Edit /etc/kubernetes/manifests/etcd.yaml and change the '--data-dir' flag to point to '/var/lib/etcd/new'
# Also, update the hostPath volume mount to reflect the new location.

# 4. Move the new data dir to the expected location
sudo mv /var/lib/etcd/new /var/lib/etcd

# 5. Wait for the kubelet to restart the etcd pod. Then, pray.

See all those grep and awk commands? That’s because you need to pull the exact same parameters (--name, --initial-cluster) that were used when the cluster was first set up. Getting any of these wrong means your restored etcd member won’t be able to talk to itself, let alone any other members. It’s finicky, which is why you test this procedure before your production cluster is on fire.

The Golden Rule: Test the Restore

A backup you haven’t tested is just a hopeful piece of digital clutter. Once a quarter, spin up a blank cluster, practice this restore procedure, and verify that kubectl get nodes works afterwards. It’s the only way to sleep soundly at night. This isn’t just a best practice; it’s the entire point of the exercise.

Taking a Snapshot: The etcdctl Way

What About a Static Pod? Getting the Endpoint Right

Actually Restoring From Backup (The “Break Glass” Procedure)

The Golden Rule: Test the Restore

Taking a Snapshot: The `etcdctl` Way