35.4 Descheduler: Rebalancing Running Pods
Right, so you’ve got your cluster humming along. Pods are scheduled, your nodes are looking busy, and everything seems… fine. But fine isn’t perfect. Over time, your pristine cluster can start to look like my garage after a long weekend project: stuff ends up in weird places for reasons that made sense at the time but are utterly baffling in the cold light of day. A node might be running at 90% memory while its neighbor is practically napping. You might have evicted a pod from a spotty node, but its replacement got scheduled right back onto the same faulty machine. This is where the Descheduler comes in. Think of it not as a failure of the main scheduler, but as its janitorial crew, working the night shift to clean up the messes that inevitably accumulate during the day.
The core scheduler’s job is to find a valid node for a pod right now. It’s not its job to continuously optimize for perfect balance across the cluster forever. It’s a one-time decision. The Descheduler, on the other hand, runs periodically, looks at the current state of the cluster, and proactively evicts pods that violate certain policies. The key word here is evicts. It doesn’t just delete them. It gracefully evicts them, which means the Pod’s owner (e.g., a Deployment, StatefulSet) gets a signal that its pod is going away and, crucially, creates a replacement. The main scheduler then gets another crack at placing that new pod, hopefully onto a better node. It’s a forced do-over.
Why You’d Actually Use This Thing
You don’t run the Descheduler because it sounds cool. You run it to solve specific, tangible problems. The main scenarios are:
- After a node drain: You cordon and drain a node for maintenance. The pods get rescheduled elsewhere, which is great. But when you add the node back to the cluster, it’s empty. New pods won’t automatically get scheduled onto this new, empty capacity because the scheduler prefers to pack nodes tightly (to avoid fragmentation) unless you tell it otherwise. The Descheduler’s
LowNodeUtilizationpolicy can spot this underutilized node and evict pods from other nodes to rebalance the load. - Fixing sticky mistakes: You might have pods running on nodes that no longer fit their constraints. Maybe a node’s label changed, or a pod’s affinity/anti-affinity rules were updated. The main scheduler won’t retroactively fix this. The Descheduler’s
RemovePodsViolatingInterPodAntiAffinityorRemovePodsViolatingNodeAffinitypolicies will find these violators and give them the boot. - Spreading out risk: You have a critical Deployment with five replicas, and through a series of unfortunate events, the scheduler has plonked all five onto the same node. If that node goes down, your service is toast. The Descheduler’s
RemoveDuplicatespolicy can evict duplicate pods from the same node, forcing them to be rescheduled onto different ones, thus respecting the spirit of high availability.
How to Run It (Without Blowing Your Foot Off)
You run the Descheduler not as a standard deployment, but as a Job or a CronJob in its own namespace. This is profoundly important. Running it as a never-ending Deployment is a fantastic way to create a continuous loop of pod evictions and rescheduling, which is the kind of “optimization” that gets you paged at 3 AM. You run it on a schedule, like once an hour, to clean up intermittently.
Here’s a sensible example of a CronJob manifest. Note the critical --v verbosity level and the policy configmap:
# descheduler-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: descheduler
namespace: kube-system
spec:
schedule: "0 * * * *" # Run at the top of every hour
jobTemplate:
spec:
template:
spec:
containers:
- name: descheduler
image: registry.k8s.io/descheduler/descheduler:v0.26.1
volumeMounts:
- name: policy-volume
mountPath: /policy-dir
command:
- /bin/descheduler
- --v
- "3" # Info level logging, be verbose but not insane
- --policy-config-file
- /policy-dir/policy.yaml
- --descheduling-interval
- "1h" # How long to wait between runs within a single execution
volumes:
- name: policy-volume
configMap:
name: descheduler-policy
restartPolicy: Never
And here’s a corresponding policy.yaml that you’d mount via the ConfigMap. This is where you tell it what to actually do.
# descheduler-policy.yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu": 20
"memory": 20
"pods": 20
targetThresholds:
"cpu": 80
"memory": 80
"pods": 80
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
"RemoveDuplicates":
enabled: true
The Pitfalls and The “Oh Crap” Moments
This is power, and with power comes the ability to accidentally disrupt every pod in your cluster. Here’s what to watch for:
- Pod Disruption Budgets (PDBs) are your safety net: The Descheduler respects PDBs. If evicting a pod would violate a PDB, it won’t do it. This is your primary mechanism for preventing it from evicting all pods for a critical service at once. Never, ever run the Descheduler without defining appropriate PDBs for your important workloads. This isn’t a best practice; it’s a requirement.
- Static Pods are a blind spot: The Descheduler cannot evict static pods (like those run by kubelet itself). They’re managed outside the API server. Don’t expect it to balance those.
- The
descheduling-intervalis subtle: This flag doesn’t control the CronJob schedule. It controls how long the Descheduler process, once started, waits between its internal passes. For a CronJob that runs hourly, setting this to also be one hour means it will basically run once and exit. This is usually what you want. - Tune your thresholds carefully: The
LowNodeUtilizationstrategy is particularly sensitive. Settingthresholdstoo high ortargetThresholdstoo low will make it far too aggressive, evicting pods constantly for minimal gain. Start conservative and monitor the eviction logs closely.
The Descheduler isn’t a magic bullet. It’s a specialized tool for a specific set of cleanup tasks. Used wisely, with appropriate safeguards (PDBs!) and a careful policy, it’s what keeps your cluster from becoming a digital hoarder’s paradise. Used recklessly, it’s a cluster-wide pod-murdering spree. I know which one I’d prefer.