35.6 Bin Packing vs Spreading: Resource Efficiency Trade-offs
Right, let’s talk about how the Scheduler decides where to dump your Pods. You’ve probably never stared at a rack of servers and thought, “You know what this needs? A really, really good game of Tetris.” But that’s essentially the Scheduler’s full-time job. It’s constantly playing a high-stakes game of bin packing with your cluster’s nodes, trying to cram as much useful work into as few physical machines as possible. This is fantastic for your cloud bill but, as with most things in engineering, it’s a trade-off. The counter-force to this ruthless efficiency is the desire to spread your workloads out for high availability. This tension between packing and spreading is the core strategic dilemma you, as the cluster operator, get to manage.
The Scheduler’s Two Key Questions
For every unscheduled Pod, the Scheduler runs through two distinct phases. First, it asks, "Can this Pod run on this Node?" This is Filtering (or Predicates). It checks hard requirements: does the Node have enough CPU and Memory? Does it have the requested GPU? Is the taint tolerable? This phase just creates a list of viable candidates. If there are none, your Pod sulks in Pending hell until you fix something.
Then, for all the Nodes that passed the first round, it asks, "How well would this Pod run on this Node?" This is Scoring (or Priorities). Each viable Node is given a score, and the one with the highest score wins. This is where our main event—the fight between Bin Packing and Spreading—takes place. The scoring functions are pluggable, but you need to understand the two heavy hitters: LeastAllocated and Spread.
The LeastAllocated Strategy (The Packer)
This scorer is your frugal, efficiency-obsessed friend. It favors Nodes that have the most available resources. The logic is simple: by placing a new Pod on the Node that’s already most empty, we leave other Nodes wide open for larger Pods later, leading to better overall cluster utilization. It’s the “fill one suitcase completely before starting another” approach.
You can see this in action if you describe a node and look at the Allocatable and Allocated resources. The Scheduler is basically doing this math for all of them and picking the one with the biggest remaining gap.
# This Pod will likely be scheduled onto the node with the least available CPU,
# because `LeastAllocated` is usually a default active scorer.
apiVersion: v1
kind: Pod
metadata:
name: greedy-pod
spec:
containers:
- name: app
image: my-app:latest
resources:
requests:
memory: "256Mi"
cpu: "500m"
This is brilliant for saving money, but it has a dark side. If all your Pods get packed onto a handful of nodes, a single node failure can wipe out a huge percentage of your capacity. This is where the other strategy comes in.
The Spread Constraints (The Worrier)
The Spread constraints are the high-availability countermeasures. Their entire job is to avoid putting all your eggs in one basket. The most common ones are topologySpreadConstraints, which tell the Scheduler to spread Pods across failure domains like Zones, Regions, or even individual Nodes.
This is where you tell Kubernetes, “I’d rather pay for a few more nodes than get paged at 3 AM.” It’s a explicit trade: resource efficiency for resilience.
# This Deployment ensures its Pods are spread across Zones first, and then across Nodes within a Zone.
# maxSkew of 1 is quite strict; it means the difference in Pod count between domains can't be more than 1.
apiVersion: apps/v1
kind: Deployment
metadata:
name: resilient-app
spec:
replicas: 4
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: resilient-app
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: resilient-app
containers:
- name: app
image: my-app:latest
The whenUnsatisfiable field is critical here. DoNotSchedule is a hard rule—if spreading can’t be satisfied, the Pod won’t deploy. ScheduleAnyway tells the Scheduler to try its best but to still schedule the Pod even if the maxSkew can’t be met, which is useful for avoiding deadlock when you have more replicas than nodes.
How They Actually Work Together
Here’s the thing the documentation often glosses over: these strategies aren’t mutually exclusive. The Scheduler doesn’t just pick one. It runs all the enabled scoring plugins. Each one gives a Node a score from 0 to 10. These scores are then weighted and added together to form a total score.
The default plugins and their weights are the key. Typically, LeastAllocated has a significant weight because the Kubernetes project defaults to favoring efficiency. The Spread constraints also score nodes, favoring those that would improve the “skew.”
This is why you can’t just think in absolutes. You’re not choosing between packing and spreading; you’re tuning the dial between them by manipulating Pod constraints and understanding the scheduler’s weighting. The default profile is a specific opinion—one that values packing highly. For production workloads, you will almost always want to override that opinion with explicit topologySpreadConstraints to force the spreading behavior you need for reliability.
The rough edge? This composite scoring can lead to surprising placements if you’re not careful. A Node might score highly on LeastAllocated (it’s very empty) but poorly on your topologySpreadConstraint (its zone is already packed). The final decision depends on the sum of the weighted scores. The best practice is to never rely on defaults for critical workloads. Define your spread constraints explicitly. Your cloud bill might be slightly higher, but your phone will be a lot quieter.