16.6 kube-proxy: iptables, IPVS, and nftables Modes
Right, let’s talk about kube-proxy. You’ve probably heard it’s the thing that makes Kubernetes Services work, and that’s mostly true. But if you think it’s a traditional proxy—some daemon sitting in the middle of the data path, inspecting packets and making decisions—I’ve got news for you. That would be horrifically inefficient. Instead, kube-proxy is a gloriously clever (and sometimes gloriously convoluted) system-level programmer. Its job isn’t to proxy traffic, but to manipulate the networking rules on the host (the iptables, IPVS, or nftables we’re about to dive into) so that traffic heading for a virtual Service IP gets magically redirected to an actual healthy Pod.
Think of it less as a bouncer checking IDs and more as a city worker who goes out in the dead of night and changes all the road signs so your GPS seamlessly takes you to a new destination without you ever knowing.
The Three Modes: A Quick Primer
kube-proxy can operate in three modes, and the choice is anything but trivial. It’s a trade-off between flexibility, performance, and complexity.
- iptables: The old reliable. Ubiquitous, rock-solid, and understood by every sysadmin who’s ever cursed at a firewall rule. This has been the default for ages.
- IPVS: The performance king. Built for load balancing massive numbers of services, like inside cloud providers. It uses kernel hash tables instead of linear chains of rules.
- nftables: The new kid on the block. It’s meant to eventually replace iptables, offering a cleaner syntax and better management. Support is still maturing in kube-proxy.
You set this with the --proxy-mode flag on the kube-proxy daemonset. Let’s crack open each one.
iptables Mode: The Devil You Know
This is where you should start, because even if you move to IPVS, understanding the iptables madness is a rite of passage. When kube-proxy runs in iptables mode, it watches the API server for Services and EndpointSlices. For every Service, it writes a small novel of rules across the nat table.
Let’s say you have a Service named my-app with ClusterIP 10.96.12.37 and two pods. The flow for a packet destined to that IP looks something like this (a simplified version of the horror show):
- Packet hits
PREROUTINGchain. - Jumps to
KUBE-SERVICESchain (the main directory). - Matches the destination IP
10.96.12.37, jumps to a service-specific chainKUBE-SVC-XXXXXXXX. - The
KUBE-SVC-XXXXXXXXchain has a bunch of rules, each with a probability-based jump to a pod-specific chainKUBE-SEP-YYYYYYYY. This is how it does random load balancing. - The
KUBE-SEP-YYYYYYYYchain finally does the DNAT (Destination Network Address Translation), rewriting the destination IP from the Service’s virtual IP to the actual Pod IP, like192.168.1.5.
You can see this magnificent, terrifying beast for yourself. SSH into a node and run:
sudo iptables -t nat -L KUBE-SERVICES -n | head -20
sudo iptables -t nat -L KUBE-SVC-XXXXXXXX -n # Replace with a real chain name from the output above
The Pitfall: The problem with iptables is that it uses linear chains. As the number of Services grows (n), the number of rules grows (O(n)), and the time to evaluate a packet does too. For a cluster with tens of thousands of Services, the kernel can be spending a non-trivial amount of time walking these chains. It works, but it’s not optimal.
IPVS Mode: The Scalability Play
IPVS (IP Virtual Server) is part of the LVS (Linux Virtual Server) project. It’s a kernel feature designed for one job: load balancing at scale. Unlike iptables, which uses linear match-and-act chains, IPVS uses hash tables for its lookups. This means the time to find the right backend is roughly constant, O(1), regardless of whether you have 10 services or 10,000.
In this mode, kube-proxy still uses iptables, but only for the initial packet filtering and masquerading. Its main job is to program IPVS with virtual services (ipvsadm -A) and real servers (ipvsadm -a). It’s a much cleaner model.
You can see it in action. On a node with kube-proxy in IPVS mode:
sudo ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.96.0.1:443 rr
-> 192.168.1.20:6443 Masq 1 0 0
TCP 10.96.12.37:80 rr
-> 192.168.1.5:8080 Masq 1 0 0
-> 192.168.1.6:8080 Masq 1 0 0
Look at that clarity! It’s beautiful. You can instantly see the virtual service IP:port, the scheduling algorithm (rr for round-robin), and the healthy backend pods.
The Catch: IPVS operates at a lower level in the networking stack. It’s less aware of the nuances of connection states than iptables. This can lead to funky edge cases, especially with externalTrafficPolicy: Local and preserving client source IPs. You need to understand these trade-offs. It’s also pickier about kernel modules. You can’t just flip the mode; you need the right ip_vs, ip_vs_rr, ip_vs_wrr modules loaded on your nodes first.
nftables Mode: The Future, Probably
nftables is the long-term successor to iptables. It has a cleaner, more consistent syntax and solves a lot of the architectural problems of iptables. The kube-proxy support for it is still considered experimental, and frankly, you don’t see it in the wild much yet. It’s good to know it’s there, waiting in the wings for when the ecosystem is ready to fully embrace it. For now, unless you’re a contributor or have a very specific need, stick with IPVS or iptables.
So, Which Mode Should You Use?
Here’s the direct, trench-forged advice:
- For most clusters (under 1000 services):
iptablesis perfectly fine. It’s the default for a reason. The performance difference is negligible, and every Linux engineer can debug it. - For large, high-scale clusters: Use
IPVS. If you’re running a platform as a service, a large mesh, or just have a ton of Services, the constant-time lookup is worth the added complexity. The debugging tools (ipvsadm) are actually superior once you learn them. - For now, ignore
nftablesmode for production workloads. Keep an eye on the Kubernetes release notes; it’ll become a big deal one day.
Regardless of your choice, remember this: kube-proxy is a controller. It reflects the desired state (Services and Endpoints) into the host’s networking data plane. When you can’t reach a Service, your first instinct shouldn’t be to restart kube-proxy—it should be to shell onto a node and see what rules it actually programmed. Because sometimes, that city worker putting up the signs makes a mistake.