11.4 HPA Behavior: Scale-Up and Scale-Down Stabilization

Alright, let’s talk about what happens after the HPA calculates it needs to scale. The raw metric says “we need 10 pods, NOW!” If we just blindly obeyed that command every polling interval, we’d be creating a chaotic mess. Pods would be frantically scaling up and down like a hyperactive yo-yo, your cluster’s control plane would weep, and your application’s performance would be a jagged nightmare of cold starts and sudden load drops. This is where behavior comes in—it’s the built-in shock absorber and common sense that prevents your cluster from having a panic attack.

Think of it this way: the raw metric calculation is your lizard brain, screaming “DANGER!” or “FOOD!”. The behavior configuration is your prefrontal cortex, saying “Whoa, hold on, let’s think about this for a second and not just punch the wall / eat the entire cake.”

The Stabilization Window: Your Panic Button Cooldown

The core mechanism for this sanity is the stabilization window. For scaling up, it’s a cooldown period. For scaling down, it’s a proving period.

When the HPA recommends a new desired replica count, it doesn’t just look at the very last metric value. It looks at all the recommended values from inside this window and chooses the one that ensures stability. For scale-ups, it takes the maximum recommendation from the window. This is a “better safe than sorry” approach. If we had a massive traffic spike 10 seconds ago and a slightly smaller one now, we scale to handle the massive spike. We’re protecting against the worst-case scenario.

For scale-downs, it takes the minimum recommendation. This is profoundly important. It means the HPA has to be convinced, for the entire duration of the window, that scaling down is safe. A single metric sample suggesting a scale-down isn’t enough; it has to be a sustained trend. This saves you from scaling down too aggressively just because of a momentary dip in traffic.

Here’s what this looks like in the YAML. The scaleUp and scaleDown stanzas are where the magic happens.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 100
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300  # A much longer, more cautious window
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Policies: The Rate Limiter

Policies are the other half of the equation. While the stabilization window decides which recommendation to trust, policies dictate how quickly we can act on it. They are a rate-of-change limiter.

In the scaleUp policy above, I’ve set it to allow adding a maximum of 4 pods every 15 seconds (type: Pods). Even if the stabilization window calculation says we need to jump from 10 to 50 pods right now, the policy will break that into smaller, less jarring steps. We’ll go 10 -> 14 -> 18… and so on. This prevents a “thundering herd” of new pods all starting at once and overwhelming your cluster’s resources (and your application’s initialization routines).

The scaleDown policy uses type: Percent, which is generally wiser for reducing capacity. It will only remove a maximum of 10% of the current replicas every 60 seconds. If you have 100 pods, it will remove 10 at a time, not 50. This is a fantastic safety net. If your metrics are wrong or there’s a brief fluctuation, you only lose a small percentage of your capacity, not half your fleet. It’s the difference between a controlled bleed and an amputation.

Why The Defaults Are The Way They Are (And When to Change Them)

The Kubernetes designers actually got this pretty right. The default scaleDown stabilization window is 5 minutes (300s), and it uses a percentage-based policy. This is because the cost of a bad scale-down decision is usually much higher than a bad scale-up decision. A bad scale-up costs you a bit of extra money for a few minutes. A bad scale-down can cause immediate latency spikes and dropped requests.

You should tighten these settings (shorter windows, higher pod/percent values) if your application is truly, predictably stateless and starts up in milliseconds. Think of a simple HTTP proxy. You can afford to be more aggressive.

You should loosen them (longer windows, lower percent values) for applications with slow startup times, warm-up caches, or long-lived connections. If your pod takes 90 seconds to be ready, scaling down 50% of them in one go is a fantastic way to have a very bad time. The default 5-minute window is a great starting point for these stateful-ish workloads. It gives connections time to drain and new pods time to warm up before more are removed.

The key takeaway? Don’t just rely on the defaults. Think about your app’s personality—is it a jittery, stateless sprinter or a heavy, stateful marathoner?—and configure the HPA’s behavior accordingly. It’s the difference between a smooth ride and a series of uncontrollable jerks.