42.7 Control Plane Failures: API Server, etcd, Scheduler

Right, so your cluster has gone sideways. The apps are down, kubectl commands are timing out, and that little voice in your head is whispering, “Was it something I did?” Probably. But more likely, it’s the control plane throwing a tantrum. This isn’t your application code; this is the brain of your entire operation having a stroke. We need to triage the patient.

The control plane’s job is to maintain state. Its entire existence is a constant loop of “observe reality, compare to desired state, reconcile.” When it fails, that loop breaks. Your first clue is almost always the kubectl command hanging or spitting out a beautiful, utterly useless The connection to the server <server-name:port> was refused - did you specify the right host or port?. Don’t panic. This just means the API server, the front door to everything, is closed for business.

Is the API Server Even Alive?

Your first move is to SSH into one of your control plane nodes and start kicking the tires locally. The API server is just a process, typically managed by kounitd or systemd. Let’s see if it’s still running.

# On a control plane node
sudo systemctl status kube-apiserver
# or if you're using a more modern setup with the Kubelet managing it
sudo crictl ps | grep kube-apiserver

If it’s inactive (dead) or just not in the list, well, there’s your problem. Start it with sudo systemctl start kube-apiserver and then immediately check its logs to see why it died in the first place. journalctl -u kube-apiserver -f is your best friend here. The most common culprits? It can’t talk to its etcd backend, or someone messed with the TLS certificates and they’ve expired or are invalid. The logs will tell you. They always tell you.

The Heart of the Matter: etcd

The API server is just a talkative middleman; etcd is the single source of truth where all your cluster’s secrets (and Deployments, and Services, and…) actually live. If etcd is down, the API server has nothing to talk to and will just give up. It’s like a librarian in an empty library—they can’t help you find a book that isn’t there.

Checking on etcd is similar. Is its process running?

sudo systemctl status etcd

But the real test is seeing if it’s healthy. If you can get to the API server, you can ask it about etcd. If you can’t, you’ll need to use the etcdctl tool directly on the node.

# Using etcdctl directly. Your cert paths and endpoints will vary.
sudo ETCDCTL_API=3 etcdctl \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  endpoint health

You should see all your etcd members (if you have a cluster) reported as healthy. If you see unhealthy, you’ve found the smoking gun. The most common reason for a single-node etcd failure is, again, disk issues. etcd is sensitive to disk performance and latency. If your disk is overwhelmed, etcd will miss its heartbeats and seize up. Check your disk I/O with iostat and make sure you haven’t filled the disk with logs (df -h). A full disk will absolutely wreck an etcd node.

The Silent Saboteur: Scheduler

What if your API server is up, your etcd is healthy, but no new Pods are being assigned to nodes? Your kubectl get pods shows a bunch stuck in Pending. Congratulations, you’ve likely found a scheduler issue. This is the quiet failure. The system seems responsive, but it’s catatonic.

First, is it running? kubectl get pods -n kube-system | grep scheduler. If it’s not, that’s a problem. But the more insidious issue is when it’s running but not functioning. Maybe it has a misconfiguration or can’t see the nodes properly. You can check its logs to see what it’s thinking.

# Get the scheduler pod name first, then its logs
kubectl logs -n kube-system <kube-scheduler-pod-name>

You’re looking for lines where it’s actually making scheduling decisions. If it’s silent, or spitting out errors about not being able to bind pods, you need to dig deeper. A common pitfall is that the scheduler’s configuration has been accidentally altered, perhaps via a ConfigMap, telling it to ignore perfectly good nodes based on some broken predicate.

The best practice here is to have multi-node, highly available control plane nodes for all these components. But let’s be honest, sometimes you’re just playing around with a single-node cluster on a old laptop. I get it. The principles are the same: check the process, check the logs, check the dependencies (network, disk). It’s not magic; it’s just a bunch of software that sometimes breaks. Your job is to be the mechanic who knows which wrench to use.