34.4 Taints: Marking Nodes as Unsuitable for Certain Pods

Right, so you’ve got your Pods happily landing on any old node that has free space. Cute. But in the real world, some nodes are special. Maybe they’re expensive GPU machines, or they’re reserved for a critical database, or they’re a bit flaky and you only want test workloads on them. You don’t want just any Pod scheduling on them. This is where taints come in.

Think of a taint as a big, angry “KEEP OFF MY LAWN” sign posted on a node. It has three parts: a key, a value, and an effect. The effect is the most important part—it tells the scheduler what to do when a Pod shows up without an invitation. A Pod gets an invitation in the form of a toleration, which is basically a note that says, “Yeah, I see your ‘LAWN’ sign, but it’s cool, I’m with the band.”

The Three Taint Effects: NoSchedule, PreferNoSchedule, and NoExecute

Not all “keep out” signs are created equal. Kubernetes gives you three levels of severity:

NoSchedule: This is the strict bouncer. If a Pod doesn’t explicitly tolerate the taint, it will not be scheduled onto this node. At all. Existing Pods already running on the node before the taint was added? They get grandfathered in and keep running.
PreferNoSchedule: This is the polite request. The scheduler will try to avoid placing a Pod that doesn’t tolerate the taint on this node, but if it has no other choice (like if it’s the only node with resources), it’ll break the rules and schedule it anyway. It’s a soft constraint, useful for best-effort segregation.
NoExecute: This is the bouncer with an eviction notice. Not only will it prevent new non-tolerating Pods from being scheduled, it will also evict any existing Pods already running on the node that do not tolerate the taint. This is the nuclear option for getting something off a node now.

Tainting a Node: The “Keep Out” Sign

Let’s say you have a node, gpu-node-1, that you want to reserve only for Pods that need a GPU. You’d slap a NoSchedule taint on it.

kubectl taint nodes gpu-node-1 hardware=gpu:NoSchedule

Breaking that down: key=hardware, value=gpu, effect=NoSchedule. Now, try to deploy a regular nginx Pod. It won’t land on gpu-node-1. Go ahead, try it. I’ll wait.

See? Your Pod is stuck in Pending because the only node with capacity told it to get lost. The scheduler event will literally tell you this if you describe the pod: node(s) had taint {hardware: gpu}, that the pod didn't tolerate.

Tolerating the Taint: The Golden Ticket

To let your GPU-hungry Pod onto that node, you need to give it a matching toleration. Here’s a Pod manifest that can handle our specific taint:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-app
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:11.0-base
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: "hardware"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"

The operator field is crucial. Here we used Equal, which means the Pod’s value must exactly match the taint’s value. You can also use Exists, which is a bit of a sledgehammer: it means “tolerate any taint with this key, regardless of its value.” Use Exists carefully; it’s a great way to accidentally schedule your production web server on a node tainted for environment=test.

The Built-in Taints: When the System Itself is the Bouncer

Kubernetes doesn’t just let you play this game; it uses taints itself. The most important ones are the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, which automatically get applied with a NoExecute effect if a node becomes unhealthy. Pods without a toleration for these are evicted immediately.

This is why you’ll see this toleration on almost every Pod deployed by Helm charts or other common controllers. It’s a best practice to allow your Pods to stay running for a few minutes in case of a temporary network partition.

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300 # Wait 5 minutes before evicting

The tolerationSeconds field is specific to NoExecute and tells Kubernetes, “I’ll tolerate this bad situation, but only for this long. After that, evict me so I can hopefully restart on a healthy node.”

Common Pitfalls and How to Avoid Them

The Exists Operator Landmine: Using operator: Exists with an empty key tolerates every single taint on the node. This is almost never what you want. You’ve effectively disabled the entire tainting mechanism for that Pod. Be specific.
Tainting and Forgetting: You tainted a node weeks ago to drain it for maintenance. You’ve long forgotten. Now your new team member is pulling their hair out because their Pods won’t schedule. kubectl describe node is your friend—it shows all active taints.
Mismatched Effects: Your node is tainted with effect: NoExecute, but your Pod only tolerates effect: NoSchedule. Your Pod will get evicted. The effect is part of the match. They must be identical.
Overusing NoExecute: Evicting Pods is a disruptive operation. Use NoExecute taints deliberately, usually for temporary maintenance or when a node is known to be fatally broken. For permanent segregation, NoSchedule is often gentler.

The beauty of this system is its simplicity and power. You can create complex scheduling rules without writing a single line of custom scheduler code. You just post the signs and hand out the tickets. It’s one of those Kubernetes features that feels a bit arcane at first but quickly becomes indispensable.