42.1 Systematic Troubleshooting Methodology
Right, let’s get this sorted. You’re staring at a CrashLoopBackOff or some other Kubernetes-induced hieroglyphic, and the panic is starting to set in. Don’t. The single biggest mistake you can make is just frantically running kubectl describe on random things, hoping for a clue. That’s like trying to fix a car engine by randomly tapping components with a hammer. You might get lucky, but you’ll probably just make it worse.
We’re going to be systematic. The goal isn’t just to fix this one problem; it’s to build a mental framework so you can fix the next ten problems without breaking a sweat. The entire system is a set of nested Russian dolls: your app runs in a container, which runs in a pod, which runs on a node, which is managed by the control plane. We start from the innermost doll and work our way out.
Start with the Pod, Not the Cluster
Your first instinct might be to check the cluster nodes or look at control plane logs. Ignore that instinct. Your problem is almost certainly in your pod. The pod is the unit of deployment; it’s where your application actually runs. This is ground zero.
Your best friend here is kubectl describe pod. This command vomits a glorious, often overwhelming, amount of information about the pod’s entire lifecycle. The key is knowing where to look. Let’s break down the output.
kubectl describe pod my-failing-app-7d98fcfc56-zxqjr -n my-namespace
Look for these sections in the output:
- Events: This is at the very bottom. It’s the chronological log of what the system has tried to do with your pod. Did it pull the image? Did it schedule it? This is your first clue. If you see
Failed to pull image "my-app:latest", well, there’s your problem. It’s not a deep control plane issue; you just typo’d the image name. - Status: Is it
Pending,Running,CrashLoopBackOff, orImagePullBackOff?Pendingoften means it can’t be scheduled (look at resource requests or node affinities). TheBackOffstates mean something inside the pod is failing. - Containers: This section will show the state and readiness of each container. The
Last StateandExit Codeare critical. An exit code of0means it shut down cleanly (weird for a crash loop). An exit code of137means it was killed, often by the kernel (OOMKilled). Anything else is likely your application crashing.
Interrogate the Container Directly
kubectl describe gives you the external view. Now you need the internal view. Use kubectl logs. The -f flag follows the logs, and the --previous flag is an absolute lifesaver if your container has already crashed and restarted.
# Get logs from the currently running container (or its attempt)
kubectl logs my-failing-app-7d98fcfc56-zxqjr -n my-namespace
# Get logs from the previous instance that crashed
kubectl logs my-failing-app-7d98fcfc56-zxqjr -n my-namespace --previous
# Follow the logs in real-time
kubectl logs -f my-failing-app-7d98fcfc56-zxqjr -n my-namespace
If your pod has multiple containers (a sidecar pattern, for instance), you must specify which container you want logs from. The kubelet isn’t psychic.
kubectl logs my-pod -c my-awesome-sidecar -n my-namespace
If logs aren’t enough—maybe your app is tight-lipped—it’s time to get interactive. kubectl exec is your ssh into the container world. This is perfect for checking if config files were mounted correctly, if dependencies are reachable, or if the disk isn’t full.
# Open an interactive shell inside the container
kubectl exec -it my-pod -n my-namespace -- /bin/sh
# Or just run a one-off command
kubectl exec my-pod -n my-namespace -- ls -la /etc/my-app/config
Pro Tip: Many slim container images don’t have /bin/bash. Get used to using /bin/sh. And if even that’s not present (looking at you, distroless images), you’re officially in “enjoy the logs” territory.
Escalate to the Node
If the pod description shows it’s stuck Pending, the problem is likely on the node. It can’t schedule the pod. Why? Run kubectl describe node on the node it was supposed to land on. The output is a monster, but you care about two things:
- Conditions: Is the node
Ready? Or is itNotReadywith aKubeletNotReadymessage? Maybe the node is out of disk pressure (DiskPressure) or memory pressure (MemoryPressure). - Allocatable Resources: How much CPU and RAM does the node actually have available? Compare this to the total
Capacity. Maybe your pod’s resourcerequestsare too high and no node has enough space for it.
kubectl describe node ip-10-0-0-101.ec2.internal
Sometimes, you need to see what the kubelet (the node agent) is seeing. You can’t ssh to production nodes? You can use kubectl debug to temporarily add a troubleshooting container to the node. This is a bit of black magic, but incredibly powerful.
# Creates an ephemeral container in the host's PID/network namespaces
kubectl debug node/ip-10-0-0-101.ec2.internal -it --image=nicolaka/netshoot
This drops you into a shell on the node itself, where you can run top, ss, iptables, docker/crictl commands, or whatever else you need to see the system state from the inside out.
Finally, Consider the Control Plane
If you’ve made it this far and everything on the node looks fine, but your pods still aren’t scheduling or API calls are failing, it’s time to look at the control plane components. This is rare for application devs, but crucial for platform folks.
First, check if the control plane pods are even running. They’re just pods in the kube-system namespace.
kubectl get pods -n kube-system
Are the API server, scheduler, and controller manager all Running? If not, you’ve found your problem. If they are, check their logs. The patterns are the same as before, just for system components.
# Get logs for the kube-scheduler pod
kubectl logs -l component=kube-scheduler -n kube-system --tail=50
# Get logs for the kube-apiserver pod (usually multiple, so pick one)
kubectl logs kube-apiserver-ip-10-0-0-100 -n kube-system -c kube-apiserver
The control plane’s job is to maintain state. If something is deeply wrong, you might need to check the state of the cluster itself with kubectl get events --all-namespaces --watch. This gives you a global event stream, which can reveal wider issues like persistent volume provisioners failing or certificate expiries.
The golden rule is this: move from the specific to the general. Your app is specific. The control plane is general. 99% of your problems will be solved by the first two steps. This methodology isn’t just a checklist; it’s a way of thinking. It forces you to gather evidence at each layer before moving to the next, eliminating guesswork and saving you a truly absurd amount of time. Now go put out that fire.