35.2 Built-in Scheduler Plugins
Right, let’s talk about how your Pods actually get a home. The kube-scheduler isn’t some mystical oracle; it’s a highly configurable, slightly pedantic librarian who follows a very specific set of rules to find the right shelf for your book (the Pod). We call these rules its scheduling plugins.
Think of the scheduling process as a two-phase filter-and-score system. First, the librarian eliminates all the shelves that are obviously wrong. Is the node out of disk? Filtered out. Does the Pod need a GPU and this node doesn’t have one? Gone. This is the Filtering phase, run by plugins like NodeResourcesFit. Then, for all the remaining, perfectly valid shelves, the librarian ranks them. “This shelf has the most free RAM, let’s give it a high score. This one has a label the Pod prefers, add a few points.” This is the Scoring phase, run by plugins like NodeResourcesBalancedAllocation. The node with the highest score wins. It’s brutally efficient.
The Usual Suspects: Default Plugins
Out of the box, you get a sensible squad of plugins doing the heavy lifting. You don’t need to configure these 99% of the time, but you absolutely need to know what they’re doing for you.
NodeResourcesFit is your bouncer. It checks if a node has enough CPU, memory, and ephemeral storage to host the Pod. It’s not smart about it—it’s just checking raw allocatable capacity. This is why you can’t just kubectl apply a Pod requesting 999999 CPUs; it’ll sit Pending forever because no bouncer will let it into the club.
NodeAffinity and NodeSelector are the next big ones. NodeSelector is the simple, blunt instrument: “Put this Pod only on a node with this label.” NodeAffinity is its more sophisticated cousin, allowing for softer preferences (preferredDuringSchedulingIgnoredDuringExecution – what a name) and more complex label matching rules. I use these to pin workloads to GPU nodes or specific availability zones.
PodTopologySpread is where things get clever. This is your best weapon for achieving high availability. It understands your cluster’s topology (like zones, regions, hostnames) and spreads your Pods across them to minimize the blast radius of a single failure. You’d be a fool not to use it for anything production-worthy.
apiVersion: apps/v1
kind: Deployment
metadata:
name: spread-me
spec:
replicas: 3
template:
spec:
containers:
- name: nginx
image: nginx
topologySpreadConstraints:
- maxSkew: 1 # Max difference of Pods between any two zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # Hard requirement
labelSelector: # Spread Pods based on their own labels
matchLabels:
app: spread-me
This tells the scheduler: “Keep the number of my Pods roughly even across zones, and never, ever put more than one Pod extra in any single zone compared to another.” It’s a thing of beauty.
Tainting My Nodes: Tolerations
This is the inverse of affinity. Instead of the Pod saying “I want this,” you tell the node to repulse Pods that don’t have a specific antidote. You taint the node, and Pods must have a matching toleration to be scheduled there. This is perfect for dedicating nodes to specific, noisy workloads.
# Add a taint to a node. This is the node's problem.
kubectl taint nodes node01 special-workload=true:NoSchedule
# And this is the Pod's solution (its toleration)
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: nginx
image: nginx
tolerations:
- key: "special-workload"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Without that toleration, the scheduler’s filtering phase would see the taint on node01 and nope right out of there. It’s a fantastic way to cordon off nodes without actually cordoning them off.
The Gotchas and Rough Edges
Now, the scheduler is brilliant, but it’s not clairvoyant. Here’s where you can step on a rake.
The biggest pitfall is resource requests. If you forget to set them, your Pod’s resources.requests is {}, which the scheduler interprets as “I need 0 CPU and 0 memory.” It will happily pack 100 of these Pods onto a single node until the kubelet—who actually has to run the thing—freaks out from memory pressure and starts murdering containers. Always set your requests. It’s not a suggestion; it’s a requirement for the scheduler to do its job properly.
Another edge case is the preemption plugin. If a high-priority Pod can’t be scheduled, the scheduler might try to evict lower-priority Pods to make room. It’s as brutal as it sounds. If you’ve ever had a perfectly healthy Pod suddenly get killed for no apparent reason, check if you have a preemptor moving in. Use priorityClassName judiciously.
Finally, remember the scheduler only cares about scheduling time. A node can have plenty of CPU when your Pod lands, but then another Pod on the same node might go haywire and consume all the CPU five minutes later. The scheduler’s job is done. It doesn’t do continuous optimization. For that, you need add-ons like the descheduler, which is a story for another time. The built-in scheduler finds a home; it doesn’t guarantee the neighbors won’t be terrible.