7.7 Scaling: Manual and with HPA

Right, so you’ve got your Pods running. They’re beautiful, they’re perfect, and they’re currently a single point of failure. You and I both know that’s not going to fly. This is where we graduate from just keeping things alive to actually managing how many of them are alive. We’re going to talk about scaling, and we’ll do it in two ways: the way you tell it what to do (manual), and the way it figures things out for itself (with the Horizontal Pod Autoscaler, or HPA). This is where your deployment starts to feel like a real, robust system instead of a fancy demo.

Manual Scaling: Because You’re the Boss

This is the simplest form of scaling, and you’ve already seen it if you’ve been paying attention. You just change the number of replicas in your ReplicaSet or Deployment manifest and apply it. Kubernetes gets the memo and does the work.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: meows-a-lot
spec:
  replicas: 3 # <-- This is the magic number
  selector:
    matchLabels:
      app: meows-a-lot
  template:
    metadata:
      labels:
        app: meows-a-lot
    spec:
      containers:
      - name: server
        image: meows-a-lot:1.0
        ports:
        - containerPort: 8080

To change it on the fly without editing the YAML file, you can use kubectl scale. This is incredibly useful for quickly reacting to an event you know is coming, like a product launch or a scheduled marketing email blast.

# Scale the "meows-a-lot" deployment to 5 replicas
kubectl scale deployment/meows-a-lot --replicas=5

# Scale a ReplicaSet named "meows-a-lot-xyz123" to 3 replicas
kubectl scale replicaset/meows-a-lot-xyz123 --replicas=3

Here’s the crucial bit a lot of beginners miss: scaling a Deployment is just a convenient way to scale the underlying ReplicaSet it manages. When you run kubectl scale deployment/whatever, the Deployment controller simply turns around and updates the replicas field on the ReplicaSet object. The ReplicaSet is still the poor sod actually responsible for counting the Pods. This is why you can also scale the ReplicaSet directly, as shown above, but it’s generally considered bad form. If you scale the ReplicaSet directly, your desired state (the Deployment manifest) and your actual state (the ReplicaSet) are now out of sync. The next time you apply a change to the Deployment—say, updating the container image—it will happily overwrite your manual replica count and reset it to whatever is defined in the manifest. Don’t fight the automation; use kubectl scale on the Deployment.

Horizontal Pod Autososcaler: Letting the Robots Drive

Manual scaling is great, but you don’t want to sit there with a dashboard open all night, ready to hammer the kubectl scale command during a surprise traffic spike. You want automation. Enter the Horizontal Pod Autoscaler (HPA). Its job is brutally simple: it watches a metric (like CPU utilization) and scales the number of Pods up or down to keep that metric at a target value you specify.

The HPA is a separate API resource that points at your Deployment, ReplicaSet, or StatefulSet. Here’s what a basic one looks like that scales based on CPU:

apiVersion: autoscaling/v2
# Use autoscaling/v2! v2beta2 is deprecated, and v1 is ancient and only supports CPU.
kind: HorizontalPodAutoscaler
metadata:
  name: meows-a-lot-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: meows-a-lot
  minReplicas: 1 # Please don't set this to 0 unless you hate being able to serve traffic.
  maxReplicas: 10 # The brakes. Without this, it could theoretically scale to infinity and bankrupt you.
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50 # Aim for 50% average CPU use across all Pods.

You create this just like any other resource, and it goes to work. But before you do, there’s a massive, facepalmingly common gotcha.

The CPU Metric Trap (And How to Avoid It)

The HPA doesn’t just magically know how much CPU your pods are using. It asks the Metrics API for that data. And for that to work, your Pods must have CPU resource requests defined. This is the part everyone forgets and then wonders why their HPA is permanently stuck showing <unknown>.

The HPA doesn’t calculate utilization as a raw percentage of the node’s CPU. It calculates it as a percentage of your Pod’s requested CPU. If your Pod requests 500m CPU and is using 250m, that’s 50% utilization. If you didn’t set a request, the denominator is zero, and the calculation explodes into a state of existential confusion. Always, always define your requests.

# The Pod template inside your Deployment MUST have this
spec:
  containers:
  - name: server
    image: meows-a-lot:1.0
    resources:
      requests:
        cpu: 250m # <- This right here is what makes the HPA work.
        memory: 64Mi
      limits:
        cpu: 500m
        memory: 128Mi

Beyond CPU: Scaling on Real Metrics

CPU is the classic example, but it’s often a terrible one. Most web applications are I/O-bound, not CPU-bound. Scaling because your API is waiting on a slow database is useless. The good news is that HPA (autoscaling/v2) can scale on pretty much any metric you can dream of, provided you have a system like Prometheus to feed that data into the Kubernetes Metrics API. You can scale on memory usage, requests per second, the length of a message queue, or even the price of Bitcoin. The latter is a terrible idea, but it’s possible, and that’s what matters.

The Cooldown: Stabilization Windows

Imagine your service gets a sudden, massive burst of traffic. The HPA sees CPU spike to 90% and scales up from 3 to 10 replicas. The traffic burst lasts 30 seconds and then ends. The HPA now sees CPU at 5% and immediately scales back down to 3. Two minutes later, the same thing happens. This frantic flapping is bad for the system and your sanity.

To prevent this, the HPA has cool-down mechanisms built in. It doesn’t just react to the last data point; it looks at a window of metrics. Furthermore, it has a stabilization window for scaling down. By default, it waits for five minutes of consistent low load before reducing the replica count. This gives it time to be sure the traffic lull isn’t just a temporary dip. You can tune this behavior in the spec.behavior section of the HPA manifest, but the defaults are usually a sane starting point.

The bottom line? Manual scaling is your quick-reaction tool. The HPA is your long-term, hands-off autopilot. But just like any autopilot, you have to set it up correctly and trust—but verify—its instruments. Now go define those resource requests. I’m not asking.