11.1 HPA: Scaling Based on CPU, Memory, and Custom Metrics

Alright, let’s talk about making your applications bend instead of break under pressure. We’re moving past the stone age of static replica counts. You don’t pay your cloud provider for a fleet of sleeping Pods, and manually scaling with kubectl scale is a party trick, not a strategy. Enter the Horizontal Pod Autoscaler (HPA), your automated, albeit occasionally dim, bartender who tops up your drinks (Pods) based on how thirsty (busy) your patrons are.

The core concept is brilliantly simple: you tell the HPA what metric to watch (e.g., CPU usage) and what target value to aim for (e.g., 50% utilization). The HPA’s job is to constantly do math to make the current metric value equal your target value by adjusting the number of Pod replicas. It’s a control loop. If CPU averages 75% against a 50% target, it needs about 1.5x more Pods (75 / 50 = 1.5), so it scales up. The reverse is true for scaling down.

The Basic CPU-Based Scaling Recipe

Let’s start with the classic: CPU scaling. This is the hello-world of HPA, but it’s fraught with tiny landmines. Here’s a minimal but functional example. First, you need an app that actually reports its CPU usage. This doesn’t come from magic; it comes from the metrics-server, which you absolutely must have installed. If kubectl top pods doesn’t work, stop here and go install it. I’ll wait.

Now, assume you have a Deployment named my-stressed-api. The key is to define resource requests in your Pod spec. The HPA calculates CPU usage as a percentage of the requested amount, not the limit or the node’s capacity. If you don’t set a request, the denominator is effectively zero, and the HPA will have a meltdown. Don’t do that.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-stressed-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-stressed-api
  template:
    metadata:
      labels:
        app: my-stressed-api
    spec:
      containers:
      - name: api
        image: my-api:latest
        resources:
          requests:
            cpu: "250m"  # <- This right here is CRITICAL for HPA
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"

With that running, you create the HPA itself. This YAML tells Kubernetes: “Hey, watch the Pods for this Deployment and keep the average CPU utilization across all of them at 50%.”

apiVersion: autoscaling/v2
metadata:
  name: my-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-stressed-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Apply that, and then watch the magic (and occasional horror) unfold with kubectl get hpa my-api-hpa -w. The output will show you the current CPU % and the desired replica count.

Why Memory Scaling is a Sneaky Trap

You can absolutely scale on memory, and the YAML looks almost identical. Just swap cpu for memory. But be warned: memory-based scaling is inherently dangerous and often a bad idea.

Why? Think about how an application uses memory. It often allocates a chunk and holds onto it (e.g., a cache). The usage goes up and… stays there. It doesn’t necessarily free memory under lower load. So if your app spikes in memory usage, the HPA will scale you out. But when the load drops, the memory usage per Pod might remain high. The HPA can’t scale you back down because it’s still over the target, leaving you stuck with a huge, expensive bill for Pods that aren’t doing any real work. It’s a one-way ticket to Cloud Bankruptcy. Use memory scaling only if your application’s memory usage is highly volatile and closely correlated with request volume, which is rare.

The Big Leagues: Custom Metrics

This is where HPA gets its real power. CPU and memory are crude instruments. What you really care about is your application’s business metrics: requests per second, queue depth, error rate, or the number of angry customer tweets. This is done via custom metrics.

The setup is more complex because you need an entire metrics pipeline—usually involving Prometheus and the Prometheus Adapter for Kubernetes. The adapter’s job is to translate Prometheus queries into little fake APIs that the HPA can understand.

Let’s say you have a Prometheus metric called http_requests_per_second. You configure the adapter to expose that. Then, your HPA manifest can target that value directly. This YAML says: “Scale so that each Pod is handling roughly 100 RPS.”

apiVersion: autoscaling/v2
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second # This is the magic sauce
      target:
        type: AverageValue
        averageValue: 100

This is infinitely more powerful. You’re scaling on what actually matters, not a proxy like CPU. The downside? You now have to manage and understand a metrics pipeline. Welcome to the trenches.

The Devilish Details: Pitfalls and Best Practices

The HPA is not a precision instrument. It’s more of a blunt club. Here’s what they don’t always tell you:

Stabilization Window: The HPA is paranoid about scaling down too quickly and causing flapping. The default cooldown period for scaling down is 5 minutes. If your traffic has sharp, short-lived spikes followed by calm, you’ll be over-provisioned for those 5 minutes. You can tune this with the behavior field, but know what you’re doing.
Thundering Herd: When a new Pod starts, it often has cold caches and no warm-up. If you scale out 5 new Pods and they immediately get slammed with traffic, they might all crash, causing the HPA to panic and scale out even more. It’s a feedback loop of death. Use Pod Disruption Budgets and readiness probes to manage this.
Cron is your friend: The HPA is reactive. Sometimes you need to be proactive. If you know a big batch job runs at 2 AM every night, use a simple CronJob to scale up your deployment beforehand and scale it down after. Don’t wait for the HPA to figure it out while your users are screaming.

The HPA is a fantastic tool, but it’s not a “set it and forget it” solution. You must understand its quirks, monitor its decisions (kubectl describe hpa is your best friend), and complement it with other strategies. Now go make your infrastructure elastic. And for the love of all that is holy, set those CPU requests.