34.2 Node Affinity: requiredDuringScheduling and preferredDuringScheduling

Right, so you’ve told your Pod where it can’t go with taints. Now let’s talk about the more polite, proactive side of the equation: node affinity. This is how you tell your Pod where it should go, or at least, where it would prefer to go. It’s the difference between “Get off my lawn!” (taints) and “Hey, you’d love it here, we have a pool!” (affinity).

The designers, in their infinite wisdom, gave us two main flavors of node affinity: requiredDuringScheduling and preferredDuringScheduling. The names are a mouthful, but they’re brutally honest about what they do. The first one is a hard requirement. If Kubernetes can’t meet it, your Pod sulks in a Pending state forever. The second is a soft preference, a suggestion. Kubernetes will try its best, but if it can’t find a node that matches, it will just shrug and schedule your Pod somewhere else. It’s the difference between “I will only eat pizza from this one specific joint in Naples” and “I’d prefer pizza, but I guess this salad will do.”

The requiredDuringScheduling Lifeguard

Use this when you absolutely, positively must have your Pod on a specific type of node. Maybe it needs a GPU, or it must be in a specific availability zone for low latency. Breaking this rule is not an option.

You define these hard requirements under nodeAffinity using nodeSelectorTerms and matchExpressions. It’s a bit more verbose than a simple nodeSelector, but it gives you the power of operators like In, NotIn, Exists, and Gt/Lt.

Let’s say you have a node pool with the label node-type=high-mem. Your memory-hungry application is not allowed to run anywhere else.

apiVersion: v1
kind: Pod
metadata:
  name: memory-guzzler
spec:
  containers:
  - name: main
    image: my-memory-hog:latest
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-type
            operator: In
            values:
            - high-mem

See that IgnoredDuringExecution part? It’s Kubernetes being pedantically precise. It means this rule is only enforced when scheduling the Pod. If someone goes and changes the label on the node after the Pod is running, Kubernetes won’t evict it. It just ignores the change. For 99.9% of cases, this is what you want.

The preferredDuringScheduling Suggestion Box

This is for your “nice-to-haves.” You want to nudge the scheduler in a certain direction, but you’re not going to throw a fit if it doesn’t work out. A classic use case is trying to spread Pods across availability zones or rack for high availability. You’d prefer it to go in zone us-west-2a, but if that zone is full, us-west-2b is perfectly fine.

Here’s the catch: preferredDuringScheduling uses a weight field between 1 and 100. This is how you prioritize multiple preferences. The scheduler adds up the weights for all matching nodes and chooses the one with the highest score.

Let’s craft a Pod that strongly prefers GPU nodes but will also somewhat prefer nodes with an SSD, and will run anywhere if neither exists.

apiVersion: v1
kind: Pod
metadata:
  name-: fancy-pod
spec:
  containers:
  - name: main
    image: my-app:latest
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 90
        preference:
          matchExpressions:
          - key: accelerator
            operator: In
            values:
            - nvidia-tesla-k80
      - weight: 10
        preference:
          matchExpressions:
          - key: disk-type
            operator: In
            values:
            - ssd

Why You Might Still Get Burned

Here’s the rough edge: the scheduler’s decision is a snapshot in time. You might have a preferredDuringScheduling rule to avoid a node that’s already running a similar Pod. But the moment your Pod lands, it now contributes to making that node less desirable for the next identical Pod. This can lead to herding, where a whole fleet of Pods all crowd onto the same “preferred” node because they all made the same calculation before any of them were scheduled. It’s a known issue, and the solution often involves tools like the Vertical Pod Autoscaler or even a custom scheduler for truly complex scenarios.

The best practice? Use requiredDuringScheduling sparingly, like a potent spice. Overuse it, and you’ll end up with a bunch of unschedulable Pods and a frustrated ops team. Use preferredDuringScheduling for most of your workload-shaping needs, but be aware of its limitations. Always check the labels on your nodes with kubectl get nodes --show-labels because the most common pitfall is simply a typo in a label key or value. The scheduler isn’t psychic; it can only work with what you give it.