11.8 Cluster Autoscaler: Adding and Removing Nodes

Right, so you’ve got your pods scaling horizontally like a well-rehearsed flash mob. But what happens when the entire party runs out of room? That’s where the Cluster Autoscaler (CA) comes in. Think of it as the pragmatic bouncer for your Kubernetes nightclub. HPA and VPA handle the guest list (pods), but when the club is at capacity, the CA is the one who calls the building manager to add a new floor or, when things quiet down, tells the unused floors they can go home. It doesn’t care about CPU or memory inside your pods; it cares about whether there’s space for pods to run at all.

Its entire job is brutally simple: it watches for pods that are stuck in a Pending state because of failed scheduling—specifically, failures due to insufficient CPU, memory, GPU, or other resources on any existing node. When it sees that, it signals your cloud provider’s API to add a node to the cluster. Conversely, it also constantly checks if nodes are underutilized and can have all their pods be easily moved to other nodes (a process called eviction). If so, it cordons and drains the node and then tells the cloud provider to terminate it.

The Gory Details of How It Works

It’s not magic; it’s a loop. The CA checks the state of the world every few seconds (configurable, but don’t touch it). It looks for any unschedulable pods. The key here is that it only acts if the pod’s failure is due to insufficient resources. If your pod is Pending because you tried to mount a non-existent PersistentVolumeClaim, the CA wisely stays out of it. That’s your problem to fix.

Once it identifies a need, it doesn’t just add one node. It evaluates the entire pending set and calculates the most efficient combination of node types (from your pre-defined list of node groups) to satisfy all the requests. It then scales up the node group(s), and the Kubernetes scheduler, which is also constantly running, will see the new capacity and schedule the pods onto it. The whole process, from pod going Pending to it running on a new node, can take a few minutes. This isn’t an SSD; it’s spinning up entire virtual machines. Patience.

Here’s the thing everyone gets wrong: the CA doesn’t schedule pods. It provides raw capacity. The kube-scheduler is the one who actually places pods on nodes. The CA just makes sure the scheduler has options.

Configuring This Beast Correctly

You don’t just turn it on; you give it rules of engagement. This is done by defining node groups (in cloud-specific ways). For example, on AWS using EKS, you’d manage this through an ASG or, preferably, a Managed Node Group. The CA needs to know the minimum and maximum size of each node group it’s allowed to manipulate.

Let’s look at a common, albeit simplified, scenario. You have a node group for general-purpose workloads and one for memory-optimized ones. You’d tag them appropriately so the CA knows they exist.

The most critical part of your configuration isn’t even in Kubernetes; it’s the IAM permissions for the CA’s pod. If it can’t call ec2:RunInstances or autoscaling:SetDesiredCapacity, it’s just a very expensive watchdog that barks but can’t bite. This is the most common “it’s not working!” issue I see.

# This is a snippet for the Cluster Autoscaler deployment itself.
# Note the critical command-line arguments.
spec:
  template:
    spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR_CLUSTER_NAME>
        - --balance-similar-node-groups
        - --expander=random
        - --v=4

Let’s talk about those flags. --balance-similar-node-groups is brilliant: it tells the CA to keep the sizes of similar node groups (e.g., two identical worker node groups in different AZs) roughly equal, which is great for high availability. --expander=random tells it how to choose if multiple node groups could solve the problem; ‘random’ is fine, other options include ‘most-pods’ and ’least-waste’. And --v=4 gives you useful debug logs without spamming you into oblivion.

The Pitfalls and “Oh Crap” Moments

Pod Disruption Budgets (PDBs) are its kryptonite. The CA will not remove a node if doing so would violate a PDB. This is correct but can leave you with underutilized nodes forever. Check your PDBs if scale-down isn’t happening.
“How do I stop it from scaling down my test cluster at 5 PM on a Friday?” You use the cluster-autoscaler.kubernetes.io/safe-to-evict: "true" annotation on pods that aren’t part of a controller (like a bare Pod resource) or that use restrictive volume types (like hostPath). Also, you can cordon a node to tell the CA to leave it alone.
The DaemonSet Dilemma: Every node has to run your DaemonSet pods (e.g., log collectors, monitoring agents). The CA knows this. It will not remove a node that has any pod (including DaemonSets) that cannot be run elsewhere. This is usually fine, but if your DaemonSet requests a gigabyte of memory, that’s a gigabyte on every node that can never be used for scale-down. Make your DaemonSets lean.
The Slow Provider API: Cloud APIs have rate limits and occasional latency. If the CA can’t check the state of your node groups or call for a scale-up fast enough, your pods will be pending longer. This is a reality of multi-tenancy in the cloud.

The Cluster Autoscaler is the silent workhorse that makes your elastic infrastructure actually elastic. It’s not glamorous, but when it works, you forget it’s there. And when it doesn’t, you’ll be neck-deep in its logs faster than you can say “why are my nodes not terminating?” Trust me, I’ve been there. Check the IAM roles first. It’s always the IAM roles.