35.3 Priority and Preemption: Evicting Lower-Priority Pods
Right, so you’ve told your Pods where they can’t run with Taints and Tolerations. Now let’s talk about how you tell the scheduler which Pods should run first, and more importantly, which ones are so important they can kick others out of the way. This is Priority and Preemption, and it’s Kubernetes’ way of saying, “This request is more important than yours, and I’m not sorry about it.”
Think of it like airport security. Most of us wait in the general queue (the standard scheduler flow). But if a pilot or a high-status frequent flyer rocks up, they get to jump the line (higher priority). And if the priority lane is absolutely full? Well, security might just ask a few people from the general queue to step aside to make room (preemption). It’s efficient, but it’s also brutal and can be deeply disruptive if you’re the one getting evicted.
How Priority Classes Work: It’s Not Just a Number
You don’t just slap a priority: 100 label on your Pod and call it a day. Oh no, that would be too simple. Instead, you have to define a PriorityClass object first. This is a small but crucial layer of indirection. It lets you name your priorities (e.g., system-cluster-critical) and change the underlying value later without updating every Pod definition. It also allows you to set a global default, which is smarter than you might think.
Here’s what a PriorityClass looks like. The value can be any 32-bit integer, and higher numbers mean, well, higher priority. The globalDefault field is a neat trick: if set to true on a single PriorityClass, any Pod that doesn’t specify a priorityClassName will get this value. You can only have one of these set to true at a time, for hopefully obvious reasons.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
globalDefault: false
preemptionPolicy: Never # This is a key twist, more on this later.
description: "This pod is high priority but will not preempt others."
Once you have a PriorityClass, you reference it in your Pod spec. The scheduler evaluates the Pod’s priority before it tries to find a node for it.
apiVersion: v1
kind: Pod
metadata:
name: important-pod
spec:
containers:
- name: nginx
image: nginx
priorityClassName: high-priority-nonpreempting
The Dark Art of Preemption: Controlled Chaos
Here’s where things get spicy. Let’s say a new Pod with a priority of 1,000,000 comes along, but all your nodes are full. The scheduler doesn’t just say, “Welp, sorry, champ.” It enters the preemption flow.
It scans the nodes, looking for ones where it could run the new high-priority Pod if it removed some lower-priority Pods. It’s not a mindless bulldozer; it has rules. It won’t evict Pods from other namespaces unless absolutely necessary, it tries to evict as few Pods as possible, and it will never evict Pods with a higher or equal priority. It’s a calculated sacrifice.
The moment it identifies a candidate node and a set of victims, it doesn’t just delete them. It gracefully issues an eviction request, which gives them their terminationGracePeriodSeconds to shut down. This is crucial—it’s not a kill -9 on the process. This is the scheduler politely but firmly showing them the door.
The Gotchas: Because Of Course There Are
This system is powerful, but it’s not magic. The preemption process is complex and can fail silently. If the scheduler can’t find a node where preemption would successfully make room, your high-priority Pod will be stuck in Pending hell. There’s no retry mechanism for preemption; it’s a one-time check at scheduling time.
Also, think about the victims. Getting preempted isn’t a graceful shutdown initiated by your application. It’s an eviction API call. If your app has state or is in the middle of a critical transaction, it might not handle this gracefully. You’re trading the stability of lower-priority workloads for the scheduling of higher-priority ones. This is a business decision, not just a technical one.
PreemptionPolicy: A Much-Needed Safety Valve
The preemptionPolicy field, which defaults to PreemptLowerPriority, is your best friend for controlling this chaos. Sometimes you have a Pod that is important but not urgent. You want it to be scheduled before other new Pods, but you absolutely do not want it to disrupt running workloads.
This is where you set preemptionPolicy: Never. A Pod with this setting will jump the line for any new scheduling decisions but will never, ever evict an already-running Pod. It’s the “I’ll wait for the next available spot” option. Use this for important but non-critical batch jobs or services that can afford to wait a minute for a node to free up naturally. It gives you the benefits of priority without the destructive side effects of preemption. It’s one of the most thoughtful design choices in this entire system, and you should use it liberally once you understand the implications.