26.5 PromQL: Querying Kubernetes Metrics
Right, let’s talk PromQL. You’ve got Prometheus scraping all sorts of juicy data from your Kubernetes cluster. That’s step one. But staring at a list of metrics is like staring at a parts bin for a race car—impressive, but useless unless you know how to assemble them into something that tells you how fast you’re going or when you’re about to blow a gasket. That’s where PromQL comes in. It’s the language you use to ask pointed, intelligent questions of your metric data. It’s deceptively simple-looking, but it has a few quirks that will drive you absolutely mad until you understand its internal logic.
The Absolute Core: Selectors and Matchers
At its heart, every PromQL query starts by selecting a set of time series. You’re not querying a table of values; you’re querying a set of individual streams of data, each uniquely identified by its metric name and its labels. This is the most important mental model to get. You don’t just ask for container_cpu_usage_seconds_total; you ask for it and then use labels to filter down to exactly the series you care about.
Let’s say you want to find the CPU usage for a specific pod. You might start with:
container_cpu_usage_seconds_total{pod="my-app-pod-abc123"}
Those curly braces {} contain your label matchers. The = is for an exact match. You can also use !=, =~ (regular expression match), and !~ (regular expression exclude). Need to find all pods in a specific namespace whose names start with “my-app”? Easy.
container_cpu_usage_seconds_total{namespace="production", pod=~"my-app-.*"}
This is your bread and butter. Get good at crafting precise selectors. The more specific you are, the less work Prometheus has to do and the faster your queries will be.
It’s All About the Rate (Almost Always)
Here’s the first thing that trips everyone up. Counters, which are metrics that only ever increase (like HTTP requests or CPU time used), are borderline useless in their raw form. You care about how fast they’re increasing, not their absolute value. Looking at http_requests_total by itself is meaningless—it could be 1000000 because your app is popular or because it’s been running for a decade.
You almost always want the rate() function. It calculates the per-second average rate of increase over a given time window. This is the workhorse of PromQL.
rate(http_requests_total[5m])
This query means: “For each time series matching http_requests_total, look at the last 5 minutes of data, calculate how much it increased per second, and give me that value.” The [5m] is a range vector selector. Crucially, rate() should always be used with a range vector, never an instant vector. The time window (5m) is a trade-off: too short and it’s spikey and noisy; too long and it smooths out real problems. I usually start with 2m or 5m.
For gauges (metrics that can go up and down, like memory usage), you use avg_over_time, max_over_time, etc., similarly.
The Joins Are Implicit, and That’s a Trap
This is the part that feels like magic until it backfires spectacularly. When you perform operations between two sets of metrics, PromQL doesn’t do a database-style join on a common key. Instead, it attempts to find elements with exactly identical label sets on the right-hand side and left-hand side. If the labels don’t match perfectly, the operation silently fails for those series.
Let’s say you want to calculate the memory usage as a percentage of the container’s defined limit. You have two metrics:
container_memory_working_set_bytes(what it’s using)container_spec_memory_limit_bytes(its limit)
A naive query would be:
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) * 100
This will probably return “no data.” Why? Because the labels might not match. The container_spec_memory_limit_bytes metric might have a container label, while container_memory_working_set_bytes uses a container_name label. Prometheus sees these as completely different labels and won’t match the series.
The solution is the on() or group_left clause to explicitly tell Prometheus how to match them. This is an advanced topic, but just know that if your math operations return nothing, mismatched labels are your first suspect.
Aggregation: Turning Detail into Insight
You’ve got CPU usage for every single container in your cluster. Great. Now your manager wants to know the total CPU usage for the entire “billing” namespace. You don’t want to show them a thousand lines. You want to aggregate.
This is where sum(), avg(), max(), and by () come in.
sum(rate(container_cpu_usage_seconds_total{namespace="billing"}[5m]))
That sums the CPU usage rate across every container in the billing namespace into a single number. But what if you want to see the total CPU usage per pod within that namespace? You use by () to preserve the labels you care about.
sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="billing"}[5m]))
This will return one time series per pod, with its respective summed CPU rate. The sum() aggregates away all other labels (like container), but the by (pod) clause tells it to keep the pod label so you can tell them apart. It’s a fantastically powerful way to roll up data at different levels of your hierarchy.
The up Metric: Your Best Friend and Worst Nightmare
This is the simplest and most important metric. up{job="kubernetes-pods"} tells you if the scrape target was successful (1) or failed (0). If this is 0, you’re getting no data from that target, and your other queries are lying to you by omission. Make a dashboard widget for this. Alert on it. Love it. But also remember: up only tells you if the scrape worked. The application pod could be up, serving a HTTP 500 error on its metrics endpoint, and up would still be 1. It’s not a health check; it’s a metrics availability check.