44.7 Controller Manager and Scheduler Tuning Flags
Right, so you’ve got your cluster up, your pods are running, but something just feels… sluggish. Deployments take a geological age to roll out, or your nodes are sitting there half-asleep while pods languish in “Pending” purgatory. Before you start yelling at the autoscaler, let’s talk about the two brainstems of your control plane: the Controller Manager and the Scheduler. They’re the anxious, overworked organizers of your cluster, and sometimes you need to adjust their caffeine intake.
The key thing to remember is that these components are just processes, running in pods (usually kube-controller-manager and kube-scheduler). They’re configured by command-line flags. You don’t edit a config file; you modify the manifest for these pods, which are typically static pods managed directly by the kubelet on the control plane nodes. On a kubeadm cluster, you’ll find their manifests in /etc/kubernetes/manifests. Messing with these is a “hold my beer” moment, so please, for the love of all that is holy, back them up first.
Tuning the Controller Manager’s Heartbeat
The Controller Manager is a bundle of nerves. It runs a dozen different control loops that constantly check the state of the world against the state you desired. Its primary performance flags revolve around how frantically it does this checking.
The --node-monitor-period and --node-monitor-grace-period are a classic duo. The first is how often it checks in on node health (default: 5s). The second is how long it waits before marking a node as unhealthy after it misses a heartbeat (default: 40s). That 40s is an eternity if you have a self-healing application that needs to reschedule pods quickly. In a tightly tuned cluster, you might lower both. But be warned: if your network is even slightly flaky, a lower grace period will cause it to declare nodes dead left and right, creating a cascading rescheduling nightmare.
# A snippet from /etc/kubernetes/manifests/kube-controller-manager.yaml
spec:
containers:
- command:
- kube-controller-manager
- --node-monitor-period=2s
- --node-monitor-grace-period=20s
- --pod-eviction-timeout=30s
# ... other flags
See that --pod-eviction-timeout? That’s the final step. After the Controller Manager declares a node dead, it tells the API server to evict the pods. This flag controls how long it waits to do that. The default is a generous five minutes, which is basically a siesta. For a performance-sensitive cluster, you want this much lower, synced with your grace period.
Scheduler Throughput: More Concurrency, Less Delay
The Scheduler’s job is to find a home for your pods. Its main performance issue is doing this one… pod… at… a… time. Seriously. The default --parallelism flag is to set the number of parallel workers to the number of cores, but it still only schedules one pod per worker at a time. This is one of those “questionable choices” that makes sense for stability but is infuriating for anyone with a pulse.
If you’re deploying a large application with dozens of pods, they get scheduled sequentially, and it can feel like watching paint dry. The fix is to tell the scheduler to evaluate more than one pod at a time per worker.
# A snippet from /etc/kubernetes/manifests/kube-scheduler.yaml
spec:
containers:
- command:
- kube-scheduler
- --parallelism=16
- --permit-address=0.0.0.0 # Often needed for higher parallelism
# ... other flags
Cranking up --parallelism can significantly improve scheduling throughput for large batches. But this is a classic trade-off: higher CPU usage on your control plane for faster scheduling. Don’t just set this to 1000; monitor your control plane node’s CPU and increase it gradually.
The Binding Rate Limit Landmine
Here’s the pitfall that gets everyone. Even with a high scheduler parallelism, you might hit an API server rate limit. The Scheduler doesn’t just think; it has to act by creating “Binding” objects in the API server. The --kube-api-burst and --kube-api-qps flags control how many requests it can make to the API server.
The defaults are a conservative Burst=30 and QPS=20. If your scheduler is trying to bind 16 pods in parallel but can only burst 30 requests, you’ve just created a bottleneck. For a cluster that needs to schedule pods quickly (e.g., running CI/CD jobs or autoscaling frequently), you must increase these in tandem with parallelism.
# For the Scheduler (and Controller Manager, which also talks to the API server a lot)
spec:
containers:
- command:
- kube-scheduler
- --kube-api-burst=100
- --kube-api-qps=50
# ... other flags
The best practice? You can’t just set these values and forget them. You must monitor the CPU and memory of your control plane nodes after making these changes. You’re fundamentally telling these components to work harder, and they will consume more resources. The goal is to find the sweet spot where scheduling and controller latency are low enough for your use case without drowning your API server in traffic. It’s a balancing act, not a magic bullet.