34.6 Topology Spread Constraints: Balanced Pod Distribution

Right, so you’ve got your Pods running, but you’ve looked at your cluster and noticed something absurd: all your web-server Pods have huddled onto the same two nodes like they’re sharing a single brain cell. The nodes hosting your stateful database? Completely empty. This isn’t just inefficient; it’s a ticking time bomb. If one of those crowded nodes goes down, your entire service might follow. This is where the scheduler’s smarter, more meticulous cousin comes in: Topology Spread Constraints.

Think of this as your fine-grained control for telling the Kubernetes scheduler, “I don’t just care if a Pod runs; I care where it runs relative to other Pods like it.” We’re moving beyond simple node selection into the realm of high availability and balanced resource consumption.

The Core Concept: Domains, Skew, and MaxSkew

At its heart, a topology spread constraint does three things:

Defines a Topology Domain: This is the “where.” The most common domain is topologyKey: kubernetes.io/hostname, which, as you guessed, means per node. But it could be per zone (topology.kubernetes.io/zone), region, or even a custom label you’ve slapped on your nodes.
Selects Pods to Compare: You tell the constraint how to identify the group of Pods it should care about. This is done via labelSelector. You’re essentially saying, “Spread these Pods across the domain.”
Sets a Tolerance for Imbalance (MaxSkew): This is the magic number. maxSkew defines the maximum difference in the number of matching Pods you’ll allow between any two domains. It must be an integer greater than 0. A maxSkew of 1 is the strictest possible; it means the count of Pods between any two nodes can only differ by one. If you have 3 nodes and 3 Pods, the scheduler will fight you to the death to get one Pod on each node.

Here’s the basic syntax. Let’s say you want your nginx Pods spread across nodes as evenly as possible.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spread-nginx
spec:
  replicas: 4
  selector:
    matchLabels:
      app: spread-nginx
  template:
    metadata:
      labels:
        app: spread-nginx
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: spread-nginx
      containers:
      - name: nginx
        image: nginx:latest

In this example, we’re saying: “Spread the Pods with the label app: spread-nginx across nodes (kubernetes.io/hostname). The difference in the number of these Pods on any two nodes must never exceed 1. If you can’t schedule a Pod without violating this rule, just don’t schedule it (whenUnsatisfiable: DoNotSchedule).”

When to Schedule Anyway: `whenUnsatisfiable`

This field is crucial and has two settings:

DoNotSchedule (the hard rule): This is the default and most strict. The scheduler won’t place the Pod if it breaks your maxSkew. Use this for critical workloads where balance is non-negotiable.
ScheduleAnyway (the soft rule): The scheduler will always schedule the Pod, but it will prefer domains that minimize the skew. This is your “try your best” option. It’s less strict and good for best-effort balancing where running is more important than perfect distribution.

Why You Almost Always Want `whenUnsatisfiable: ScheduleAnyway`

Here’s a dirty little secret the manuals often gloss over: if you use DoNotSchedule with a fixed number of replicas, you can accidentally box yourself into a corner. Imagine you have 3 nodes and set maxSkew: 1 with DoNotSchedule. You start with 3 replicas; perfect, one per node. Now you scale to 4. The scheduler wants to place the fourth Pod. But placing it on any node would create a node with 2 Pods, making the skew between that node (2) and another (1) equal to… 1. Wait, that’s allowed. My example is bad.

Let’s use a better one. Say you have 2 nodes. You set maxSkew: 1 and DoNotSchedule. You have 2 Pods, one on each node (skew=0). You scale to 3. The scheduler cannot place the third Pod. Placing it on node A would give you counts of (2,1). The skew between node A and node B is |2-1| = 1, which is allowed. It should schedule. My bad again. The real pitfall is with multiple constraints or when a node fails.

The real-world pitfall is when a domain fails. If a node dies and you’re using DoNotSchedule, the Pods that were on it can’t be rescheduled because it would violate the skew. This is why ScheduleAnyway is often more practical for many use cases—it allows for graceful degradation and recovery. The lesson is to think carefully about what happens during failures.

The Power of Multiple Constraints

This is where it gets really powerful. You can define multiple constraints to achieve complex scheduling goals. The classic example: spread across nodes within a zone first, then across zones themselves.

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-critical-app
- maxSkew: 1
  topologyKey: kubernetes.io/hostname
  whenUnsatisfiable: ScheduleAnyway
  labelSelector:
    matchLabels:
      app: my-critical-app

This configuration reads as:

First priority: The number of my-critical-app Pods in one zone cannot exceed the number in another zone by more than 1. Never break this rule (DoNotSchedule).
Second priority: Within a zone, try to spread the Pods across nodes as evenly as possible, but it’s a best-effort policy (ScheduleAnyway).

This ensures true high availability. You’re protected from a whole zone outage and you get decent bin-packing within the zone. It’s brilliant.

The Quiet Killer: Matching Your Own Labels

The most common “why isn’t this working?!” moment is a misconfigured labelSelector. The Pods you’re trying to spread must match the selector in the constraint. In our first example, the Pod template has app: spread-nginx, and the constraint selects Pods with app: spread-nginx. If you forget to add that label to the Pod, or you typo it, the scheduler acts like the constraint doesn’t exist. It’s a silent failure. Always double-check your labels. It feels dumb when you find the bug, but we’ve all done it.

Topology spreads are one of those features that separate a beginner from an expert. They move you from just running software to running resilient software. Use them. Your future self, dealing with a node failure at 3 AM, will thank you.

The Core Concept: Domains, Skew, and MaxSkew

When to Schedule Anyway: whenUnsatisfiable

Why You Almost Always Want whenUnsatisfiable: ScheduleAnyway

The Power of Multiple Constraints

The Quiet Killer: Matching Your Own Labels

When to Schedule Anyway: `whenUnsatisfiable`

Why You Almost Always Want `whenUnsatisfiable: ScheduleAnyway`