42.6 Node NotReady: Common Causes and Remediation

Alright, let’s talk about a Node going into NotReady state. It’s Kubernetes’ way of telling you, “Hey, I’ve got a problem over here and I can’t schedule any more work on this server.” It’s not being lazy; it’s being honest. Your job is to figure out why.

Think of the Kubelet on each node as a harried middle manager. Its sole job is to constantly report back to the Control Plane (Head Office) that its node (retail store) is open for business and has shelf space. The Node object is that status report. When the Kubelet stops sending good reports—or any reports at all—the Control Plane, after a few minutes of radio silence, marks the node as NotReady. It’s a safety mechanism. It’d rather stop sending you customers than send them to a store that might be on fire.

First, always confirm what you’re seeing. Don’t just trust the dashboard.

kubectl get nodes

You’ll see a list, and one of them will have NotReady staring back at you in the STATUS column. Now, let’s get the gossip. The describe command is your best friend here. It’s the node’s full medical chart.

kubectl describe node <node-name>

Scroll down to the bottom. The Conditions section is pure gold. It tells you exactly what self-checks the Control Plane thinks are failing. You’re looking for Ready to be False and, crucially, the Reason and Message beside it. This is your first major clue.

Kubelet: The Usual Suspect

Nine times out of ten, this is a Kubelet problem. It’s the process responsible for talking to the API server and managing containers (via containerd or Docker) on the node. If it’s crashed, misconfigured, or just deeply unhappy, the node goes NotReady.

SSH into the node (because yes, you’ll often need to do that) and check the Kubelet’s vitals. The commands differ slightly based on your OS, but the dance is the same.

# On a systemd-based node (most of them)
systemctl status kubelet -l

Is it running? If not, start it: sudo systemctl start kubelet. Now, is it still running? Check the logs. If it’s crashing in a loop, the logs will tell you why.

journalctl -u kubelet -n 100 --no-pager

Common Kubelet issues include:

Disk Pressure: The node’s disk is full. The Kubelet will panic because it can’t pull images or write logs. Go find and delete some old Docker images or logs.
Incorrect credentials: It can’t authenticate with the API server. Check its kubeconfig file, usually in /var/lib/kubelet/kubeconfig or passed as a flag.
Network Plugin Woes: If you’re using a CNI like Calico or Flannel, and its pod isn’t running on that node, the Kubelet might hang. This is a classic “chicken and egg” problem the designers gifted us. Check that your CNI DaemonSet has a pod running on this node.

Resource Pressure: Memory, Disk, PID

The Kubelet isn’t just being dramatic. If the node is out of memory (MemoryPressure) or disk space (DiskPressure), it will report NotReady to prevent the scheduler from making the problem worse. This is a good thing, even if it ruins your day.

kubectl describe node will show these conditions clearly. If you see True next to MemoryPressure or DiskPressure, you’ve found your culprit.

Disk Pressure: This is the most common. The container runtime and the Kubelet are famously chatty, writing logs and layers all over the place. SSH in and run df -h. You’ll likely see /var/lib/docker or /var/lib/containerd sitting at 100%. You need to clean house.
```
# Get a sorted list of what's eating your disk
sudo du -h /var/lib/docker/overlay2 | sort -rh | head -n 20
```
You might need to prune old images and containers. It feels brutal, but it works.
```
docker system prune -a -f
```
Warning: This will remove all unused images, not just the ones from a week ago. It’s a blunt instrument, but effective.
Memory Pressure: The OS is starving and starts killing processes. The Kubelet might be one of them. Use free -h and top to see what’s going on.

The Network Is Lying

Sometimes, the Kubelet is perfectly healthy and is shouting “I’m ready!” into the void. The node has resources, but the Control Plane never gets the message. This is a network partition.

From the problematic node, can you ping your control plane endpoints? Can you curl the API server?

# Find your API server's internal IP or hostname
kubectl get endpoints kubernetes
# Then from the node, try to reach it
curl -k https://<api-server-ip>:6443

If that fails, you’ve got a network routing issue between your node and your control plane. Time to talk to your network admin or cloud provider. It’s not a Kubernetes problem anymore; it’s an infrastructure problem.

The Nuclear Option: Reboot It

Look, I know. It’s the tech equivalent of “have you tried turning it off and on again?” But sometimes, after you’ve checked the Kubelet and resources, a node is just in a weird state that’s faster to fix by rebooting than by deep-diving into its existential crisis.

Drain the node first to politely evict all your pods. This is non-negotiable. It tells the scheduler to move the workloads elsewhere.

kubectl drain <node-name> --ignore-daemonsets --delete-local-data

The --ignore-daemonsets is needed because, well, you can’t move DaemonSets like your CNI pod. --delete-local-data is a warning that any pods using EmptyDir volumes might lose data. After the drain, reboot the node. Once it’s back up, uncordon it to tell Kubernetes it’s open for business again.

kubectl uncordon <node-name>

It’s not elegant, but it’s often the fastest path to getting your cluster back to a fully Ready state.