34.3 Pod Affinity and Anti-Affinity: Co-Locating and Spreading Pods
Right, so you’ve told your pods where they can run with node selectors and affinities. But what about telling them who they should run with? Or, more importantly, who they should avoid? That’s where pod affinity and anti-affinity come in, and they’re the divas of the scheduling world. They don’t just care about the node itself; they care about the other pods already throwing a party on it.
Think of it like this: node affinity is about the hardware (“I need a GPU”). Pod affinity is about the neighbors (“I need to be next to my database for low latency,” or, conversely, “For the love of all that is holy, do not put me on the same node as that memory-hogging cache service”).
The Two Flavors: Affinity and Anti-Affinity
This isn’t complicated. You have two levers to pull:
- Pod Affinity: “Schedule me near these other pods.” (Attraction)
- Pod Anti-Affinity: “Schedule me away from these other pods.” (Repulsion)
And for each lever, you have two distinct types that change the game entirely:
requiredDuringSchedulingIgnoredDuringExecution: This is the hard rule. The scheduler must obey this. If it can’t find a node that meets this criteria, your pod just sits there, lonely and un-scheduled. Use this for absolute requirements.preferredDuringSchedulingIgnoredDuringExecution: This is a soft request, a “nice-to-have.” The scheduler will try its best to fulfill it, but if it can’t, it will just schedule the pod somewhere else and call it a day. Use this for optimizations.
Yes, the names are a mouthful. The “IgnoredDuringExecution” part means that if the pod landscape changes after the pod is scheduled (e.g., someone deletes a pod it was supposed to be near), Kubernetes won’t evict your already-running pod. It’s a scheduling-time rule only.
Crafting the Rule: Topology Keys
This is the most important, most confusing, and most often messed-up part. You don’t just say “be near this pod.” You say “be near this pod on a node that shares the same value for topologyKey.”
The topologyKey is the domain you’re operating in. The most common one is kubernetes.io/hostname—this means “on the exact same physical node” or “on a different physical node.” Other examples are topology.kubernetes.io/zone (same cloud availability zone) or topology.kubernetes.io/region.
apiVersion: v1
kind: Pod
metadata:
name: web-app
labels:
app: web-app
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: kubernetes.io/hostname
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-app
topologyKey: kubernetes.io/hostname
containers:
- name: web-app
image: nginx:latest
Let’s read this horror show of YAML. The podAffinity rule says: “This pod MUST (required) be scheduled on a node that already has a pod with the label app=redis-cache running on it, and that ‘on it’ is defined by the topologyKey: kubernetes.io/hostname.”
The podAntiAffinity rule says: “Please try (preferred) to not schedule this pod on a node that already has another pod labeled app=web-app on it.” This is a great way to spread your replicas across different nodes for high availability. The weight is a number between 1-100 that tells the scheduler how important this preference is compared to other preferences.
Why This Is a Scheduling Nightmare (and How to Avoid It)
Here’s the brutal truth: requiredDuringScheduling pod (anti-)affinity rules can make your scheduler’s life hell and lead to deadlock. Imagine you have two services that each have a rule saying “I must run near the other one.” If neither is running, which one gets scheduled first? Neither. They both sit there waiting for the other to exist.
Best practices, straight from the trenches:
- Prefer
preferredoverrequired. Use hard requirements only for truly mission-critical co-location. Your ability to deploy and scale will thank you. - Keep your
labelSelectorbroad. Being too specific can lead to the deadlock problem above. Maybe useapp=redisinstead ofapp=redis-cache and role=primary. - Use Anti-Affinity for Spreading Replicas. This is the killer use case. To prevent all replicas of your stateful application from going down in a single node failure, use an anti-affinity rule to tell them to spread out.
# Example: Spread our API replicas across nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api-server
topologyKey: kubernetes.io/hostname
containers:
- name: api-server
image: my-api:latest
This deployment will flat-out refuse to schedule two api-server pods on the same node. It’s a hard rule for maximum resilience.
The Subtle Pitfall: Namespaces
By default, your affinity rules only look within the same namespace. This is usually what you want—you don’t want your pod demanding to be next to a database pod in some other team’s namespace. But if you do need that, you have to explicitly set a namespaces field in the podAffinityTerm. Tread carefully here; it couples your deployments across namespace boundaries, which is often an architectural smell.
Pod affinity and anti-affinity are incredibly powerful tools for shaping your cluster’s topology. They move you from “I hope the scheduler figures it out” to “I am deliberately architecting my application for performance and resilience.” Just wield that power responsibly, or you’ll be the one debugging why half your pods are Pending.