19.7 Pod Priority and Preemption

Right, let’s talk about Pod Priority and Preemption. This is where Kubernetes stops being polite and starts getting real. Up until now, we’ve mostly talked about resource requests and limits as a way for the scheduler to make informed decisions. But with priority, we’re giving it a direct command: “This pod is more important than that one. Act accordingly.”

Think of it like this: your cluster is a lifeboat. There’s only so much room (CPU and Memory). If a new, critically important person needs to get on (a high-priority pod), and there’s no space, someone else might have to… unceremoniously take a swim (get preempted). It’s brutal, but for many workloads (like system-critical services or CI/CD pipelines where you don’t want a build job blocking a production web server), it’s absolutely essential.

How Priority and Preemption Actually Work

It’s not magic; it’s two main resources working in tandem: PriorityClass and the priorityClassName field in your Pod spec.

First, you define a PriorityClass. This isn’t namespaced; it’s a cluster-wide resource that sets the ground rules. The value is the key—it’s an arbitrary integer, but higher numbers win. The globalDefault is a sneaky one. Set it to true on one class, and every pod without a priorityClassName will get this priority. Do this on more than one? Congratulations, you’ve created a configuration paradox. Kubernetes will just ignore them all. Best practice: leave globalDefault set to false (the default) unless you have a very specific, cluster-wide reason to change it.

Here’s how you define a couple of classes. Naming them something clear like "high-priority" is infinitely better than "priority-1000000".

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for important service pods."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
description: "Use this for batch jobs or dev workloads that can be preempted."

Now, to use it, you simply slap the class name into your Pod spec. This is where the rubber meets the road.

apiVersion: v1
kind: Pod
metadata:
  name: important-api-server
spec:
  containers:
  - name: api-server
    image: my-company/important-api:latest
    resources:
      requests:
        memory: "512Mi"
        cpu: "500m"
  priorityClassName: high-priority # This is the crucial line

When the scheduler sees a new pod with a priorityClassName, it knows its integer value. If the cluster is out of resources, the scheduler doesn’t just give up. It checks to see if there are any pods running with a lower priority that could be evicted to free up space for this new, more important pod. If it finds candidates, it kicks them out. This is preemption.

The Gory Details of Preemption

Preemption isn’t a gentle “please terminate.” It’s a forced eviction. The scheduler picks victims (lower-priority pods), and the kubelet starts terminating them. Crucially, the higher-priority pod does not start until the victims are completely gone and their resources are freed. This isn’t a zero-sum game; it’s a negative-sum game for a moment because you’re killing running workloads.

The designers, in a rare moment of foresight, added some safeguards so you don’t completely nuke your cluster:

The scheduler won’t preempt pods from a higher-priority PodDisruptionBudget. You can’t break a PDB to satisfy a priority.
It tries not to preempt pods from the same namespace as the preemptor, which is a nice way to avoid one team’s workload cannibalizing itself.
It won’t preempt if the preemption would cause the entire cluster to drop below a requested resource. It’s a calculated move.

Common Pitfalls and How to Avoid Them

This power comes with immense responsibility, and it’s easy to shoot yourself in the foot.

The Silent Default Problem: As mentioned, be hyper-careful with globalDefault. If you set it on a class with a value of 0, you’ve effectively disabled preemption for any pod that doesn’t explicitly ask for a higher class, because everything is now equal. I recommend never using it. Be explicit.
The Resource Request Black Hole: Preemption is entirely dependent on resource requests. A pod with no resource requests is treated as requesting 0 CPU/Memory. The scheduler will have a devil of a time figuring out if preempting other pods will free up enough resources for it. Your high-priority pod must have sensible requests, or the preemption logic will fail silently, and your pod will languish in Pending hell.
The Zombie Pod Problem: When a pod is preempted, it isn’t just deleted. Its status becomes Failed, and if it’s managed by a controller (like a Deployment), that controller sees the failure and immediately spins up a replacement pod. This new pod has the same low priority. So now you have your high-priority pod running, and the system is also trying to restart the low-priority one you just killed. If there still aren’t enough resources, the new low-priority pod will get stuck Pending. You haven’t solved a resource shortage; you’ve just reshuffled it and created a noisy neighbor. You need mechanisms like quiescing your workloads or using preemption policies to handle this gracefully.
The “Why Did My Pod Get Killed?!” Mystery: Without proper visibility, a developer will see their pod mysteriously die and have no idea why. The event log on the preempted pod is your friend. A kubectl describe pod <preempted-pod> will show a message like Preempted by pod <your-high-priority-pod> on node <node-name>. Tag your priority classes clearly and educate your users. Transparency is key when you’re playing musical chairs with people’s workloads.