39.2 Node Pools and Virtual Machine Scale Sets

Right, let’s talk about the actual workers in your AKS cluster. You’ve got your control plane managed by Azure (which is a blessing, trust me), but the nodes—the VMs where your pods actually run—are your responsibility. And in AKS, you don’t manage individual nodes; you manage node pools, which are backed by the real hero (or sometimes villain) of Azure compute: Virtual Machine Scale Sets (VMSS).

Think of a node pool as a group of identical worker bees. They all have the same CPU, memory, OS, and often, the same labels and taints. You define the hive, and Azure scales the number of bees for you. Under the hood, this hive is a VMSS. It’s the Azure infrastructure service that allows you to create and manage a group of identical, load-balanced VMs. The AKS team chose VMSS because it’s the native way to get fast, reliable scaling and automated repairs. Trying to do this with individual VMs would be a nightmare they wisely decided to spare you.

The Default System Node Pool and Why You Shouldn’t Trust It

When you run az aks create, it helpfully creates a “system” node pool for you. It’s usually a modest 3 nodes. This is where system pods like CoreDNS and metrics-server live. Here’s the first questionable choice you’re likely to make: leaving critical system pods on the same nodes as your application workloads. It’s a recipe for noisy-neighbor problems. If your memory-hogging app evicts the DNS pod, your entire cluster starts having an identity crisis. The best practice? Create a dedicated system node pool first thing.

# Create a new node pool specifically for system-critical pods
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name systempool \
    --node-count 2 \
    --node-vm-size Standard_B2s \
    --labels critical-addons=true \
    --taints CriticalAddonsOnly=true:NoSchedule \
    --mode system

Notice the --mode system? That tells AKS to expect and tolerate those core system pods. The taint (CriticalAddonsOnly=true:NoSchedule) is crucial—it prevents your application pods from being scheduled here unless they explicitly tolerate it. Your application node pools should be created with --mode user.

User Node Pools: Choosing Your Weapon

This is where you run your actual applications. You’ll likely have multiple user node pools for different workloads. The beauty here is specialization. Need a pool for GPU-intensive AI workloads? Make one with Standard_NC6 VMs. Have a bunch of background processing that needs a ton of RAM but not much CPU? Standard_E4s_v3 is your friend. This separation is how you control cost and performance.

# Create a user node pool for memory-intensive applications
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mempool \
    --node-count 3 \
    --node-vm-size Standard_E4s_v3 \
    --labels workload=memory-intensive \
    --tags Environment=Production

The --labels are your friend. Use them liberally to guide your pod deployments with nodeSelector. The --tags apply to the underlying Scale Set resources, which is great for cost tracking.

The VMSS Integration: The Good, The Bad, and The “Oh, Azure”

The VMSS integration is mostly brilliant. Need to scale? You change the node count on the pool, and the VMSS handles the provisioning or deprovisioning of instances. A node gets unhealthy? The VMSS health probe detects it and automatically replaces it with a new, shiny node. This is the managed part of AKS working for you.

But here’s the rough edge: you cannot SSH into a VMSS node using its public IP. The designers made a choice for security and network simplicity that can be frustrating when you’re debugging. The nodes live in a private VNet. To get in, you have to jump through a bastion host or use the nifty az aks nodepool run command, which is Azure’s way of saying, “Fine, but do it my way.”

# Execute a command directly on a node without needing SSH
az aks nodepool run \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --nodepool-name mynodepool \
    --command "sudo cat /var/log/cloud-init-output.log" # Often gold for debugging provisioning issues

Another pitfall: VMSS OS disk sizing. The default OS disk for nodes is often 128GB. If you’re running container images that layer like a wedding cake or apps that write massive logs to the root filesystem, you will fill this up. The node will become unusable. Always specify --node-osdisk-size with a reasonable value (e.g., 256GB) when creating pools for production workloads.

Spot Node Pools: The Ultimate Cheap Thrill

This is where we embrace the absurdity of cloud economics. Azure Spot VMs are unused capacity you can rent for up to 90% off. The catch? Azure can evict them with a 30-second warning whenever they need the capacity back. It’s perfect for stateless, fault-tolerant batch jobs, CI/CD runners, or any workload that can be interrupted. Using them in AKS is a no-brainer for cost savings, but you must design your applications to handle sudden termination gracefully.

# Create a spot node pool for interruptible workloads
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name spotpool \
    --priority Spot \
    --eviction-policy Delete \ # This is what you want. Deallocate would just stop the VM and still charge you for the disks.
    --spot-max-price -1 \ # Sets the max price to the current spot price, not the on-demand price. Crucial.
    --node-vm-size Standard_DS2_v2 \
    --labels workload=interruptible

The key here is the priority Spot and --eviction-policy Delete. Your pods on these nodes will get a PodDisruptionBudget and, if you’re smart, a toleration for the kubernetes.azure.com/scalesetpriority=Spot:NoSchedule taint that AKS automatically adds. Never put anything critical here, but for the right workload, the savings are frankly ridiculous. It’s like cloud arbitrage.