Alright, let’s talk about the brain of your Kubernetes cluster: the API Server. It’s the grand central station for every single request, from kubectl get pods to the kubelet checking in on what it should be running. And like any good central station, it can get completely overwhelmed if you let everyone stampede through at once. That’s where rate limiting and caching come in. They’re the bouncers and the express lanes that keep this whole operation from collapsing into a fireball of 429 Too Many Requests errors.

The Almighty --max-requests-inflight

This is your primary knob. The API Server has a hard limit on the number of non-mutating (read: GET, LIST, WATCH) requests it will process concurrently. Mutating requests (POST, PUT, PATCH, DELETE) get their own, separate, and usually lower queue. Think of it this way: you’d rather have ten people reading a sign at the same time than ten people all trying to rewrite the sign at the same time. The latter ends in a fistfight.

The default values (400 for non-mutating, 200 for mutating) are actually pretty reasonable for most medium-sized clusters. But if you’ve got a lot of controllers, operators, or a particularly chatty set of services doing service discovery, you’ll hit this limit. The symptoms are obvious: your requests start getting rejected with HTTP 429 errors.

To see your current limits, you need to peek at the API Server’s command-line arguments. If you’re running a managed cluster, you might not have access to this, but for self-hosted setups:

# Find the API Server pod(s)
kubectl get pods -n kube-system -l component=kube-apiserver

# Check its running arguments
kubectl describe pod kube-apiserver-master-node -n kube-system | grep -e --max-requests-inflight -e --max-mutating-requests-inflight

If you need to change it, you’re going to be editing your static pod manifest or your systemd unit file on the control plane node. It’s not a dynamic setting. A word of caution: cranking this up too high is a fantastic way to overwhelm your API Server and the etcd behind it with too much concurrent work, leading to a slow, painful death for everyone. Increase it gradually and monitor the API Server’s latency and CPU.

Client-Side Rate Limiting: Because You’re the Problem

Here’s the fun part: sometimes the problem isn’t the API Server’s limits, it’s you. Well, your client code. The kubectl tool and the official Go client have built-in client-side rate limiting to prevent any single client from being a jerk and spamming the server.

The default in the Go client is a paltry 5 QPS with a burst of 10. If you’re writing a controller that needs to list hundreds of pods, this will feel like trying to drink a milkshake through a single, tiny straw. You’ll spend all your time waiting.

When you create your Kubernetes client in Go, you absolutely must configure this for anything that does non-trivial work. It’s not optional.

package main

import (
	"context"
	"fmt"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
	"time"
)

func main() {
	// Load config from file or in-cluster
	config, err := clientcmd.BuildConfigFromFlags("", "/path/to/kubeconfig")
	if err != nil {
		config, err = rest.InClusterConfig()
		if err != nil {
			panic(err.Error())
		}
	}

	// THIS IS THE CRITICAL PART
	// Bump those limits to something sane for a background controller
	config.QPS = 50.0          // Queries per second. 50 is a good starting point.
	config.Burst = 100         // Burst up to 100 requests. Allows for, well, bursts.
	config.Timeout = 30 * time.Second // Don't hang forever.

	// Create the clientset
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}

	// Now use your clientset as normal...
	pods, err := clientset.CoreV1().Pods("default").List(context.TODO(), metav1.ListOptions{})
	if err != nil {
		panic(err.Error())
	}
	fmt.Printf("Found %d pods in default namespace\n", len(pods.Items))
}

The key is to set QPS and Burst to values that are appropriate for your use case. A human using kubectl? Defaults are fine. A CI/CD pipeline? Maybe bump it a bit. A central cluster operator that watches everything? Crank it up, but be mindful of the server-side limits we just discussed.

The Watch Cache: The API Server’s Secret Weapon

Every LIST or GET request doesn’t necessarily have to go all the way down to etcd and back. The API Server maintains an in-memory cache of the entire state of the cluster for each resource type. This is the Watch Cache. When you make a LIST request, it’s often served directly from this cache, which is orders of magnitude faster than going to etcd.

The size of this cache is controlled by the --watch-cache-sizes flag. If you have a huge number of a specific resource (say, 100,000 ConfigMaps), you might need to increase the cache size for that resource to avoid frequent cache evictions and subsequent slow reads from etcd.

# This is how you'd configure it on the API server command line
--watch-cache-sizes=configmaps#1000,secrets#1000,replicasets#500

This is a tuning art, not a science. You need to look at the metrics for cache misses and etcd request latencies to know if you have a problem. The main pitfall here is that increasing the cache size for everything will use more memory on your API Server node. You’re trading memory for latency, a classic computing dilemma.

The best practice? Use the cache aggressively on the client side too. Don’t constantly re-list everything. Use the WATCH API wherever possible. It’s more efficient for you and for the server, as it pushes only the changes to you instead of you constantly polling for the entire dataset. It’s the difference between getting a live feed versus walking back to the bulletin board every five seconds to see if the notice has changed.