11.5 VPA: Right-Sizing Container Resource Requests

Right, so you’ve got HPA scaling the number of your pods based on traffic. That’s great. But what if the pods themselves are the problem? You’ve got a container running with a paltry 100m CPU request, but it’s constantly spiking to 800m and getting throttled into next Tuesday by the kernel. Or worse, you’ve got a memory leak slowly filling up a node because some container requested a laughably small 128Mi and is now trying to swallow 2Gi. This is where Vertical Pod Autoscaler (VPA) comes in—it’s the friend that tells you you’ve been wearing the wrong-sized clothes all along and helps you get a better fit.

VPA’s job is to watch the actual resource usage of your containers and automatically adjust their requests (and optionally limits) in the pod spec. It stops you from under-requesting (which causes throttling and evictions) and over-requesting (which wastes money and clogs your cluster). It’s like having a continuous cost-performance optimizer running in the background.

How VPA Operates: Recommender, Updater, and Admission Controller

VPA isn’t a single monolithic thing; it’s a collection of three components that play a very specific game of telephone. You don’t have to run all three; you can just use the recommender for advice and handle updates manually.

First, the Recommender. This is the brains. It ingests historical resource usage metrics (from, say, Prometheus) and, based on your chosen update policy, calculates what it thinks the new requests should be. It’s constantly outputting these recommendations, shouting into the void.

Second, the Updater. This is the brawn. It listens to the Recommender. When it sees a pod that doesn’t match its recommended request, it will, if configured to do so, actually evict that pod so it can be recreated with the new resource values. This is the “or else” part of the operation.

Finally, the Admission Controller. This is the bouncer. When the Updater forces a pod eviction and the pod is recreated by its controller (e.g., a Deployment), this component intercepts the pod creation request before it hits the API server. It mutates the pod’s spec, injecting the VPA’s recommended resource values on the fly. This is how the new request values actually get applied.

Installation and a Basic Example

Let’s get this thing running. The easiest way is to use the standard installation. Note: VPA needs access to metrics, so you’ll need something like the Metrics Server installed first.

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler/
./hack/vpa-up.sh

Now, let’s create a VPA resource for a sample deployment. This is where you define the scope and the policy.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: my-app-deployment
  updatePolicy:
    updateMode: "Auto" # Other options: "Off", "Initial", "Recreate"
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "1"
        memory: "1Gi"
      controlledResources: ["cpu", "memory"]

This VPA will manage all containers (containerName: "*") in the my-app-deployment Deployment. In Auto mode, it will automatically evict and update pods. We’ve set sane boundaries (minAllowed/maxAllowed) because we’re not insane—we don’t want it recommending a 16-core CPU request for a tiny Go service.

The Crucial Update Modes: Auto, Off, Initial, and Recreate

This is the most important choice you’ll make. Each mode serves a different purpose:

Off: The VPA will only provide recommendations in its status (kubectl describe vpa <vpa-name>), but it won’t take any action. This is perfect for a dry-run to see what it would do before you let it loose.
Initial: The VPA will only apply resource changes when the Pod is first created. It won’t touch already running pods. Useful for getting initial configurations right but leaving running apps alone.
Recreate: This is essentially the same as Auto (it automatically updates requests) but it only applies to pods controlled by a workload object (Deployment, StatefulSet). It won’t act on bare pods. This is a safer default than Auto.
Auto: The full monty. It will automatically apply recommendations and evict pods to have them recreated with the new values. Use this with extreme caution, especially for stateful applications.

The Dark Arts: Pitfalls and Best Practices

VPA is powerful, but it has some famously sharp edges. Let’s call them out.

It Cannot Be Used With HPA (on CPU/Memory). This is the big one. VPA and HPA are like two kids fighting over the same toy. If you have an HPA that scales on CPU or memory metrics, and a VPA that’s changing the request values for those same resources, they will fight in an endless feedback loop. HPA sees CPU usage go down because VPA increased the request, so it scales down pods. Then VPA sees usage go up per pod because there are fewer of them, so it increases requests again. It’s a mess. You can use them together only if your HPA scales on a custom or external metric that isn’t affected by the resource request changes.
Pod Disruption is Inevitable. In Auto or Recreate mode, VPA works by evicting pods. Your application needs to be able to handle graceful termination and restart. This is a non-issue for well-designed cloud-native apps, but it can be a nasty surprise for others.
StatefulSets Require Care. Evicting a pod managed by a StatefulSet has implications for identity and stable storage. You need to be absolutely sure your application can handle being restarted with a new resource footprint without corrupting its state. Tread lightly here.
It’s a Recommendation Engine, Not a Crystal Ball. VPA is great for adjusting to typical traffic patterns. It is not good at predicting sudden, massive, unprecedented spikes. If your app gets slammed by a new event, VPA will react, but it will be too late to prevent throttling or OOM kills. You still need to base your minAllowed values on reasonable expectations for worst-case scenarios.

So, should you use it? Absolutely—but start in Off mode. See what it recommends. Then maybe move to Initial for new deployments. Only use Auto/Recreate for stateless, resilient workloads where you’ve tested the disruption process and have hard limits in place. Used correctly, it’s the most effective tool in your arsenal for cutting cloud bills and improving cluster stability. Used incorrectly, it’s a automated wrecking ball. You’ve been warned.