40.6 Monitoring etcd: Key Metrics and Alerts

Alright, let’s get our hands dirty with etcd monitoring. Think of etcd as the meticulous, slightly neurotic librarian for your entire Kubernetes cluster. It doesn’t just store where the books are; it is the library’s card catalog. If it gets slow, starts dropping index cards, or just decides to take a long lunch, your entire cluster grinds to a halt. We’re not just checking if the lights are on; we’re checking its pulse, its reflexes, and its stress levels.

The Golden Signals: What to Watch and Why

You can’t fix what you can’t measure, so let’s talk about the four key metrics that tell you 95% of what’s happening inside etcd.

First, leader changes. etcd is a Raft-based consensus protocol, which means it elects a leader to handle all writes. This is great for consistency, but it’s a bit like herding cats. If the leader keeps changing (a “raft instability”), it’s a massive red flag. Every time a new leader is elected, the cluster blocks for a few heartbeats to get its act together. Too many of these, and your API server requests will start timing out. You want this number to be as close to zero as possible.

Second, commit latency. This is the time between a request (like kubectl apply) being proposed and being committed to etcd’s disk. This is your primary performance metric. High latency here means your API server clients are sitting around waiting, and users start complaining that kubectl is “slow.” You’ll see this measured in milliseconds, and you should care about both the average and, more importantly, the 99th percentile (p99). The p99 latency tells you about the worst-case scenarios that are actually ruining your users’ day.

Third, wal_fsync_duration_seconds. This is the latency of the Write-Ahead Log (WAL) sync to disk. etcd is a stickler for data safety—it must fsync your data to disk before it tells the client the write was successful. If your disk is slow (looking at you, spinning rust, or a badly over-provisioned cloud disk), this metric will skyrocket and take your commit latency with it. This is almost always an I/O subsystem problem.

Fourth, etcd_server_leader_changes_seen_total. Scrape this counter metric and alert on a rate() increase. A steady, non-zero rate here means your cluster is unstable. The usual culprits are network latency between members or a system under too much load to respond to Raft heartbeats in time.

Setting Up a Basic Monitoring Stack

You’re probably using Prometheus, so here’s a practical example. First, make sure you’re scraping etcd’s metrics endpoint. etcd exposes its metrics on its client port (usually 2379) at /metrics. Your scrape config might look something like this:

# prometheus.yml snippet
scrape_configs:
  - job_name: 'etcd'
    static_configs:
    - targets: ['10.0.1.10:2379', '10.0.1.11:2379', '10.0.1.12:2379']
    scheme: https
    tls_config:
      ca_file: /path/to/etcd-ca.crt
      cert_file: /path/to/etcd-client.crt
      key_file: /path/to/etcd-client.key
    insecure_skip_verify: false # Never set this to true in production, you maniac.

Now, let’s translate those golden signals into actual PromQL queries you can throw into a dashboard or alert.

# Alert on frequent leader changes
rate(etcd_server_leader_changes_seen_total[15m]) > 0

# High commit latency (alert if 99th percentile exceeds 100ms for 10m)
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.1

# High WAL fsync latency (alert if 99th percentile exceeds 1s for 10m)
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 1

The Sneaky Pitfalls: What They Don’t Tell You in the Manual

Now for the real-world stuff. The designers made some… interesting choices.

Pitfall #1: The Memory Blowup. etcd isn’t just a simple key-value store; it’s a multiversion concurrency control (MVCC) store. It keeps a history of key changes. This is fantastic for watch functions, but it means its internal memory usage is proportional to the number of changes, not just the total data size. If you have pods churning constantly, you’re generating a huge amount of revision history. This will balloon etcd’s memory usage until it eventually runs defrag (which is another performance hit). You must monitor etcd_mvcc_db_total_size_in_bytes and etcd_server_quota_backend_bytes. If the database size gets too close to the quota, etcd will go into a read-only maintenance mode to avoid data corruption. Not fun.

Pitfall #2: The Quiet TCP Slaughter. etcd uses HTTP/2 for client communication and gRPC for member communication. Both can, under heavy load, lead to TCP connection starvation. You might see mysterious “broken pipe” or “context deadline exceeded” errors in your API servers even though etcd metrics look fine. This is a systems-level problem. Monitor the number of open connections to your etcd members (netstat -an | grep :2379 | wc -l is a start) and your OS’s TCP retransmission rates.

The bottom line? Monitor the metrics, but also monitor the system etcd runs on. No amount of brilliant software can outrun the physical limitations of a bad disk or a congested network.