42.2 Pod Not Starting: Pending, CrashLoopBackOff, ImagePullBackOff

Alright, let’s get our hands dirty. Your pod isn’t starting. It’s just sitting there, mocking you with a status like Pending, CrashLoopBackOff, or ImagePullBackOff. This isn’t a failure; it’s the cluster’s way of sending you a strongly worded letter explaining exactly what you did wrong. Your job is to learn how to read it.

First, the golden rule: always start with kubectl describe. Your kubectl get pods output is the headline; kubectl describe is the full investigative report. If you don’t do this first, I can’t help you. It’s like calling a mechanic and saying “my car is broken” but refusing to pop the hood.

kubectl describe pod <your-pod-name> -n <namespace>

Scan the Events: section at the bottom. It’s almost always the culprit. Now, let’s break down what those specific statuses actually mean.

ImagePullBackOff: “I Can’t Find Your App, Man”

This one is brutally straightforward. Kubernetes is telling you, “I tried to fetch the container image you specified and failed spectacularly.” The describe command will tell you exactly why. Common reasons:

The image doesn’t exist. You typo’d the name (my-app:v1 vs. myapp:v1), the tag doesn’t exist (v1 vs. latest), or you forgot to push it to the registry. It happens to the best of us.
You need credentials. The image is in a private registry (like Docker Hub, GCR, ECR, Quay) and you haven’t told Kubernetes how to authenticate. This is where you need a secret.

# First, create the secret (do this once, or via a CI/CD pipeline)
kubectl create secret docker-registry my-registry-key \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=your-username \
  --docker-password=your-password-or-token \
  --docker-email=your-email

# Then, reference it in your Pod (or, better, your Deployment spec)
apiVersion: v1
kind: Pod
metadata:
  name: my-app-pod
spec:
  containers:
  - name: my-app
    image: my-private-registry/your-image:tag
  imagePullSecrets:
  - name: my-registry-key

Best practice: Never use :latest in production. It’s a non-deterministic nightmare. Use specific, immutable tags (e.g., a git commit SHA, v1.2.3-abc1234). This avoids the “it worked on my machine!” paradox because you’re sure you’re running the same artifact.

Pending: “The Cluster is All Out of Real Estate”

A Pending pod means the scheduler can’t find a suitable node to run it on. It’s not a pod problem yet; it’s a cluster capacity or policy problem. The describe output is crucial here. Key things to check:

Insufficient resources: You’re asking for more CPU or memory (requests) than any node has available. Check your resource requests and limits.
Taints and Tolerations: A node might have a taint (e.g., dedicated=special-user:NoSchedule) that repels all pods unless they have a matching toleration. Your pod needs the right “key” to get in.
Node Selectors / Node Affinity: You might have constrained your pod to only run on nodes with specific labels (disktype=ssd), but no node has that label.

CrashLoopBackOff: “It Starts… And Then It Dies. A Lot.”

This is the most entertaining one because the problem is inside the container itself. Kubernetes successfully started your process, but it exited almost immediately. Kubernetes, being a patient and optimistic system, keeps restarting it, backing off exponentially each time to avoid hammering the node.

The describe command will show the exit code. exit 0 is a graceful shutdown, which is weird for a background process. Anything else (exit 1, exit 137) is a failure. Your next move is to look at the logs of the previous run.

# Get logs from the last running instance
kubectl logs <pod-name> --previous

# Or, if you want to tail the logs and watch it crash in real-time
kubectl logs <pod-name> -f

Why does it crash?

The application crashed. Check your app’s configuration, its connection to databases, or its own logs. A missing config file or environment variable is a classic.
A failed readiness or liveness probe. Your probes are so strict they’re failing. If your liveness probe fails, Kubernetes kills the pod. If your app takes 30 seconds to start up but your probe starts after 5 seconds, you’ve built a suicide machine.
Out of Memory (OOM) Kill. You hit your memory limit. The kernel oom_killer process nuked your container. This will often show as exit code 137. Check your limits; they might be too low. kubectl describe pod will show this under Last State.

The Pro-Tip: For really tricky CrashLoopBackOff scenarios where the logs aren’t enough, run a super-sleep pod with the same image to get a shell inside and debug interactively.

kubectl run -it --rm debug-pod --image=your-broken-image:tag --restart=Never -- bash
# Now you're inside! Try running your startup command manually, check files, etc.

Remember, Kubernetes is brutally logical. These errors aren’t mysteries; they are precise diagnostics. Your job is to listen.