35.1 Scheduling Pipeline: Filtering and Scoring
Alright, let’s pull back the curtain on the main event: the scheduling pipeline. This is where the rubber meets the road. The scheduler doesn’t just pick a node out of a hat; it runs every Pod candidate through a rigorous, two-phase gauntlet: Filtering (also called Predicates) and Scoring (also called Priorities). Think of it like a reality TV show. First, we eliminate all the contestants who don’t meet the basic requirements (Filtering). Then, we judge the remaining contestants on their talents to pick a winner (Scoring).
The Filtering Phase: “Can You Even Run Here?”
This phase is brutally binary. For each node, the scheduler runs all the filter plugins. If any single plugin fails on a node, that node gets kicked out of the running. It’s an instant, irrevocable veto. The goal is to produce a list of nodes that are, technically speaking, capable of running the Pod. We call this list the feasible nodes.
Common reasons for a node to get filtered out:
- Insufficient CPU/Memory: The Pod’s requests are more than the node has available.
- NodeSelector/NodeAffinity Mismatch: The node doesn’t have the labels the Pod is demanding.
- Taints and Tolerations: The node has a
taintthat the Pod doesn’ttolerate. This is the primary mechanism for keeping random Pods off specialized nodes like those with GPUs. - Volume Zone Conflicts: The Pod requires a PersistentVolume that only exists in another availability zone.
- HostPort Conflict: Another Pod is already using the requested
hostPorton that node.
Here’s the thing: if all nodes get filtered out, your Pod isn’t doomed. It just stays in Pending until a node does become feasible (e.g., another Pod terminates, freeing up resources, or a node with the right label is added). It’s a queue. A sometimes-frustratingly-long queue.
The Scoring Phase: “Which of You is the Best?”
Once we have our shortlist of feasible nodes, the scheduler needs to pick the best one. This is where it gets nuanced. Each scoring plugin runs against each feasible node and gives it a grade, typically a integer between 0 and 100. The scheduler then combines these grades—weighted by the plugin’s importance—to produce a total score for each node.
The node with the highest score wins.
Let’s look at some of the most important scoring strategies:
LeastRequested: Favors nodes with the most free resources. It’s the default “spread things out” strategy. The score is based on(free CPU + free Memory) / total allocatable CPU & Memory).BalancedResourceAllocation: This is the clever cousin ofLeastRequested. It doesn’t just favor nodes with the most free resources; it favors nodes whose free CPU and free memory are balanced. You don’t want a node that’s 95% CPU-free but 99% memory-allocated. This plugin tries to prevent that resource skew, which makes future scheduling easier.ImageLocality: Favors nodes that already have the container images the Pod needs cached on them. This is a huge win for startup time, as it avoids the network pull. It’s the scheduler being politely efficient.InterPodAffinity: This is a complex one. It scores nodes highly if they satisfy Pod affinity/anti-affinity rules (e.g., “run me near this other service” or “for the love of all that is holy, do NOT run me on the same node as that other Pod”).
You can see the final scores in the scheduler’s logs if you turn up the verbosity. It’s incredibly revealing. Let’s simulate a simple scenario. Imagine a Pod requesting 100m CPU and 100Mi memory.
# pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: example
image: nginx
resources:
requests:
memory: "100Mi"
cpu: "100m"
Now, let’s see which nodes are feasible and how they might score. You can’t directly query the scheduler’s decision, but you can often infer it by describing a scheduled Pod or by using kubectl describe node to see allocatable resources.
The Pitfalls: Where This All Goes Pear-Shaped
This system is elegant but not clairvoyant. Here’s where the cracks show:
- The Perfect is the Enemy of the Good: The scheduler makes a decision at a single point in time. A node might be the perfect 100/100 score now, but if a giant Pod lands on it 500ms later, it’s suddenly a congested mess. The scheduler doesn’t continuously re-balance the cluster; it’s a reactionary system.
- The Black Box Veto: If your Pod is stuck
Pending, it’s often because it was filtered from all nodes. Debugging this means checking all the usual suspects: quotas, taints, resource requests.kubectl describe pod <pod-name>is your best friend here; it usually tells you exactly which filter kicked you out. - Scoring Myopia: The default scoring weights might not match your priorities. Maybe you care far more about ImageLocality than BalancedResourceAllocation. This is why you can write and configure your own scheduler plugins—a topic for another chapter, but know that the escape hatch exists.
The designers chose this filter/score pattern for a solid reason: performance. Filtering first drastically reduces the problem space for the more computationally expensive scoring phase. It’s a pragmatic choice, even if it sometimes feels a bit rigid. Your job is to work within its constraints, label your nodes meaningfully, and set sensible resource requests. Do that, and this brilliant, slightly-pedantic system will work beautifully for you.