26.2 kube-state-metrics: Cluster-Level Metrics from API Objects

Right, so you’ve got Prometheus scraping your nodes and pods. That’s a great start, but it’s like knowing the engine RPM and fuel levels of every single car in a massive parking lot without knowing which ones are actually driving, who’s driving them, or if any of them are about to run out of gas and stall in the middle of the highway. For that, you need to understand the state of your Kubernetes API objects—the Deployments, DaemonSets, StatefulSets, and so on. This is where kube-state-metrics (KSM) comes in. It’s the translator that sits between the abstract world of the Kubernetes API and the concrete, number-crunching world of Prometheus.

Think of it this way: the Kubernetes API server knows that your web-frontend Deployment is configured to have 10 replicas, but currently, only 8 Pods are ready. The /metrics endpoint on your nodes and kubelet knows about the individual Pods, but not their desired state. KSM bridges this gap. It watches the API server for changes to objects like Deployments, Nodes, and Pods, and then generates metrics that describe their state. It doesn’t measure performance like CPU usage; it tells you about the health and status of your cluster’s orchestration.

What kube-state-metrics actually gives you

KSM exposes a plethora of metrics, all prefixed with kube_*. The most immediately useful ones are often:

kube_deployment_status_replicas_available{deployment="web-frontend"}: The number of available replicas. You’ll alert on this.
kube_deployment_spec_replicas: The desired number of replicas. Comparing this to status_replicas_available tells you if your cluster is healthy.
kube_pod_status_phase: The phase (Pending, Running, Succeeded, Failed) of each pod. A count of pods in phase="Pending" for too long is a classic sign of resource starvation.
kube_job_status_failed: Did a Job fail? This is how you know.
kube_node_status_condition: Is the node Ready? Out of Disk? Under Memory Pressure? Gold.

Without these, you’re flying blind to the actual state of your deployments and the scheduler’s decisions.

Deploying it correctly (it’s not a DaemonSet)

A common misconception is that KSM needs to be on every node. Nope. It’s a simple Deployment that talks to the Kubernetes API. It has no need to be colocated with your workloads. You’ll find its manifests in the official GitHub repo, but let’s be honest, the default setup is a bit… anemic. You should deploy it into its own namespace (kube-system is fine, but I prefer monitoring) and, crucially, give it its own dedicated ServiceAccount with the correct RBAC permissions. The project provides these RBAC manifests, and you must use them.

Here’s a condensed example of getting it running. First, grab the latest manifests (always check for the latest version):

# Clone the repo and use the standard manifests
git clone https://github.com/kubernetes/kube-state-metrics.git
cd kube-state-metrics/examples/standard

Now, apply them. Note the order: RBAC first, then the deployment and service.

kubectl apply -f serviceaccount.yaml
kubectl apply -f cluster-role.yaml
kubectl apply -f cluster-role-binding.yaml
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Verify it’s running and has endpoints:

kubectl get pods -n kube-system -l app.kubernetes.io/name=kube-state-metrics
kubectl get endpoints kube-state-metrics -n kube-system

The crucial step: Getting Prometheus to scrape it

Deploying KSM is only half the battle. It’s sitting there, happily exposing metrics on port 8080 (and metrics on 8081, but you almost certainly don’t need those). But if Prometheus doesn’t know to scrape it, it’s all for nothing. This is where a ServiceMonitor (if you’re using the Prometheus Operator) or a scrape config in your prometheus.yml comes in.

Here’s a basic ServiceMonitor that will do the trick:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kube-state-metrics
  namespace: monitoring # assuming Prometheus is installed here
spec:
  selector:
    matchLabels:
    app.kubernetes.io/name: kube-state-metrics
  namespaceSelector:
    matchNames:
    - kube-system # or wherever you installed KSM
  endpoints:
  - port: http # This must match the port name in the KSM Service
    interval: 30s # Scraping every 30s is usually plenty for these metrics.

Performance and cardinality: Don’t shoot yourself in the foot

Here’s the part the manual often glosses over: KSM can be a cardinality monster if you’re not careful. It exports metrics for every object of a given kind. If you have 1000 pods, you’ll have 1000 times more kube_pod_info metrics than if you have 10 pods. This can absolutely hammer your Prometheus storage.

The designers clearly knew this was a risk, which is why they added the --resources flag. You can use this to tell KSM to only export metrics for the API objects you actually care about. Do you really need metrics for every single HorizontalPodAutoscaler or PersistentVolumeClaim? Probably not. Tailor it to your needs.

# Example addition to the KSM deployment args to reduce cardinality
args:
- '--resources=pods,deployments,nodes,daemonsets,statefulsets'

Always check the cardinality of the kube_* metrics in your Prometheus server (prometheus_status_tsdb_head_series is a good metric to watch) after you install it. It’s a brilliant tool, but like any powerful machinery, it requires a mindful operator.