16.6 kube-proxy: iptables, IPVS, and nftables Modes

Right, let’s talk about kube-proxy. You’ve probably heard it’s the thing that makes Kubernetes Services work, and that’s mostly true. But if you think it’s a traditional proxy—some daemon sitting in the middle of the data path, inspecting packets and making decisions—I’ve got news for you. That would be horrifically inefficient. Instead, kube-proxy is a gloriously clever (and sometimes gloriously convoluted) system-level programmer. Its job isn’t to proxy traffic, but to manipulate the networking rules on the host (the iptables, IPVS, or nftables we’re about to dive into) so that traffic heading for a virtual Service IP gets magically redirected to an actual healthy Pod.

Think of it less as a bouncer checking IDs and more as a city worker who goes out in the dead of night and changes all the road signs so your GPS seamlessly takes you to a new destination without you ever knowing.

The Three Modes: A Quick Primer

kube-proxy can operate in three modes, and the choice is anything but trivial. It’s a trade-off between flexibility, performance, and complexity.

iptables: The old reliable. Ubiquitous, rock-solid, and understood by every sysadmin who’s ever cursed at a firewall rule. This has been the default for ages.
IPVS: The performance king. Built for load balancing massive numbers of services, like inside cloud providers. It uses kernel hash tables instead of linear chains of rules.
nftables: The new kid on the block. It’s meant to eventually replace iptables, offering a cleaner syntax and better management. Support is still maturing in kube-proxy.

You set this with the --proxy-mode flag on the kube-proxy daemonset. Let’s crack open each one.

iptables Mode: The Devil You Know

This is where you should start, because even if you move to IPVS, understanding the iptables madness is a rite of passage. When kube-proxy runs in iptables mode, it watches the API server for Services and EndpointSlices. For every Service, it writes a small novel of rules across the nat table.

Let’s say you have a Service named my-app with ClusterIP 10.96.12.37 and two pods. The flow for a packet destined to that IP looks something like this (a simplified version of the horror show):

Packet hits PREROUTING chain.
Jumps to KUBE-SERVICES chain (the main directory).
Matches the destination IP 10.96.12.37, jumps to a service-specific chain KUBE-SVC-XXXXXXXX.
The KUBE-SVC-XXXXXXXX chain has a bunch of rules, each with a probability-based jump to a pod-specific chain KUBE-SEP-YYYYYYYY. This is how it does random load balancing.
The KUBE-SEP-YYYYYYYY chain finally does the DNAT (Destination Network Address Translation), rewriting the destination IP from the Service’s virtual IP to the actual Pod IP, like 192.168.1.5.

You can see this magnificent, terrifying beast for yourself. SSH into a node and run:

sudo iptables -t nat -L KUBE-SERVICES -n | head -20

sudo iptables -t nat -L KUBE-SVC-XXXXXXXX -n  # Replace with a real chain name from the output above

The Pitfall: The problem with iptables is that it uses linear chains. As the number of Services grows (n), the number of rules grows (O(n)), and the time to evaluate a packet does too. For a cluster with tens of thousands of Services, the kernel can be spending a non-trivial amount of time walking these chains. It works, but it’s not optimal.

IPVS Mode: The Scalability Play

IPVS (IP Virtual Server) is part of the LVS (Linux Virtual Server) project. It’s a kernel feature designed for one job: load balancing at scale. Unlike iptables, which uses linear match-and-act chains, IPVS uses hash tables for its lookups. This means the time to find the right backend is roughly constant, O(1), regardless of whether you have 10 services or 10,000.

In this mode, kube-proxy still uses iptables, but only for the initial packet filtering and masquerading. Its main job is to program IPVS with virtual services (ipvsadm -A) and real servers (ipvsadm -a). It’s a much cleaner model.

You can see it in action. On a node with kube-proxy in IPVS mode:

sudo ipvsadm -Ln

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.96.0.1:443 rr
  -> 192.168.1.20:6443            Masq    1      0          0
TCP  10.96.12.37:80 rr
  -> 192.168.1.5:8080             Masq    1      0          0
  -> 192.168.1.6:8080             Masq    1      0          0

Look at that clarity! It’s beautiful. You can instantly see the virtual service IP:port, the scheduling algorithm (rr for round-robin), and the healthy backend pods.

The Catch: IPVS operates at a lower level in the networking stack. It’s less aware of the nuances of connection states than iptables. This can lead to funky edge cases, especially with externalTrafficPolicy: Local and preserving client source IPs. You need to understand these trade-offs. It’s also pickier about kernel modules. You can’t just flip the mode; you need the right ip_vs, ip_vs_rr, ip_vs_wrr modules loaded on your nodes first.

nftables Mode: The Future, Probably

nftables is the long-term successor to iptables. It has a cleaner, more consistent syntax and solves a lot of the architectural problems of iptables. The kube-proxy support for it is still considered experimental, and frankly, you don’t see it in the wild much yet. It’s good to know it’s there, waiting in the wings for when the ecosystem is ready to fully embrace it. For now, unless you’re a contributor or have a very specific need, stick with IPVS or iptables.

So, Which Mode Should You Use?

Here’s the direct, trench-forged advice:

For most clusters (under 1000 services): iptables is perfectly fine. It’s the default for a reason. The performance difference is negligible, and every Linux engineer can debug it.
For large, high-scale clusters: Use IPVS. If you’re running a platform as a service, a large mesh, or just have a ton of Services, the constant-time lookup is worth the added complexity. The debugging tools (ipvsadm) are actually superior once you learn them.
For now, ignore nftables mode for production workloads. Keep an eye on the Kubernetes release notes; it’ll become a big deal one day.

Regardless of your choice, remember this: kube-proxy is a controller. It reflects the desired state (Services and Endpoints) into the host’s networking data plane. When you can’t reach a Service, your first instinct shouldn’t be to restart kube-proxy—it should be to shell onto a node and see what rules it actually programmed. Because sometimes, that city worker putting up the signs makes a mistake.