44.1 Kubernetes at Scale: Tested Limits and Real-World Numbers
Right, let’s talk about scale. You’ve probably seen the eye-watering, “look-at-me” conference talk numbers from Google or Netflix about running eleventy-billion pods. That’s great for them. We live in the real world, where your cluster isn’t running on a planet-sized data center and your CFO has questions about the cloud bill. So let’s get practical. What actually breaks first when you push a Kubernetes cluster, and what can you do about it? Forget the theory; these are the pressure points I’ve seen burst in production.
The Control Plane: Your First Bottleneck
The API Server is the brain, and it’s a chatty one. Every single thing that happens in your cluster—every kubectl command, every pod creation, every background list/watch—is an HTTP request to it. Its performance is almost entirely a function of two things: etcd’s health and the sheer number of concurrent requests.
The first sign of trouble is usually increased API server latency. You’ll see kubectl commands hanging, or your GitOps operator (looking at you, ArgoCD) complaining it can’t reconcile. The culprit is often a poorly designed controller or a misconfigured application hammering the API with list requests.
Here’s a classic. Someone deploys an app that does this every 30 seconds:
# Don't do this. This is how you get put on a watchlist.
kubectl get pods --all-namespaces
This is a full dump of the entire pod manifest for every pod, every 30 seconds. It’s brutally inefficient. The right way is to use a watch operation, which opens a single, long-lived connection and gets incremental updates. But for those times you need to poll, at least be polite and filter aggressively.
# Be specific. Your API server will thank you.
kubectl get pods --namespace my-app --label-selector=app=frontend
The real pro move is setting up priority and fairness rules on your API server to ensure a rogue tenant can’t DOS the entire control plane. It’s a bit advanced, but it’s a lifesaver in multi-tenant environments.
etcd: The Beating Heart You Forgot About
The API Server is just the talkative frontman; etcd is the band doing all the work. It’s a consistent, distributed key-value store, and its performance is the absolute foundation of your cluster. If etcd gets slow, the entire cluster gets slow. And then it dies.
The biggest killers of etcd performance are:
- Disk I/O Latency: etcd is log-based. It must fsync writes to disk to be consistent. If you put its storage on some crummy, network-backed storage with high I/O latency, you’ve already lost. This is non-negotiable: etcd needs fast local SSDs.
- Large Kubernetes Objects: I once saw a 5MB ConfigMap break a cluster. No, really. The etcd request size is limited (default is 1.5MB). A few large objects can fill up its database quota, but more importantly, they make every serialization/deserialization operation slower. Keep your Secrets, ConfigMaps, and especially your CRDs lean. Don’t stick a 2MB XML dump in a ConfigMap; use a volume or an object store.
- Too Many Watchers: Every active watch on a resource consumes memory in etcd. Thousands of watches from various controllers can add up. You can’t avoid this, but you can be mindful of it.
You can check the health of your etcd cluster from the outside. The etcdctl tool is your best friend here.
# Check endpoint health
ETCDCTL_API=3 etcdctl --endpoints=<your-endpoint> --cacert=<ca.crt> --cert=<cert.crt> --key=<key.key> endpoint health
# Check alarm status (you want this to be empty)
etcdctl alarm list
# Check latency to the leader
etcdctl check perf
The Kubelet: When 100 Pods on a Node Actually Cry
The theoretical pod limit per node is high (250-ish). The practical limit is often much, much lower. Why? Because the Kubelet on each node has to check the status of every pod on that node, in a loop called the sync loop. This is the nodeStatusUpdateFrequency (default 10s) and syncFrequency (default 1m). More pods means more work per loop, which means the Kubelet can fall behind.
When it falls behind, it reports NodeNotReady because it’s too busy to, well, get ready. The main resource hit isn’t CPU or memory for the pods—it’s the CPU on the node itself for the Kubelet process.
You’ll see this in your metrics as kubelet_runtime_operations_duration_seconds or kubelet_pod_worker_duration_seconds spiking. The fix? You can try tuning the Kubelet’s --max-pods flag down from its default (110) to a more reasonable number for your workload density. Sometimes, 50 well-packed pods are more stable than 110 sparse ones.
The Real Number: It’s the Network, Always
You can have the beefiest control plane and nodes, but if your CNI (Container Network Interface) plugin can’t handle the flow, everything grinds to a halt. The number of iptables rules that kube-proxy creates in iptables mode scales with the number of Services and Pods. Past a certain point, updating a single rule can take seconds because it has to linear-search a gigantic chain.
This is the prime reason everyone moved to IPVS mode for kube-proxy at scale. IPVS uses hash tables for load balancing, so its performance is constant time, O(1), regardless of the number of Services.
# You can often set this in your cluster provisioning config
kind: KubeProxyConfiguration
mode: "ipvs"
If you’re on a cloud provider, their CNI plugin (like AWS VPC CNI or Azure CNI) has its own limits on pods per node based on available IPs. This is often the hardest limit you’ll hit. Always, always check your CNI plugin’s documentation for its scaling parameters. It’s the least glamorous but most common cause of “but my nodes have plenty of capacity!” failures.
The bottom line? Scale testing isn’t a luxury; it’s a necessity. Before you promise the moon, take a test cluster and hammer it with kube-bench or a simple pod-creator script. Find your limits, not the ones in the shiny spec sheet. Because in the end, the most real-world number is the one that keeps your pager from going off at 3 AM.