40.4 Network Tuning: net.core, net.ipv4, TCP Buffer Sizes
Right, let’s talk about tuning the network stack. This is where we stop politely asking the kernel to move data and start telling it. The /proc/sys/net/ directory is our control panel, and sysctl is the button-laden, slightly confusing remote. We’re going to focus on the big ones: net.core, net.ipv4, and the glorious, often-misunderstood world of TCP buffers.
First, a reality check. The kernel’s default settings are designed for a hypothetical, perfectly average machine from roughly a decade ago. They are comically conservative for a modern server with 10GbE or 40GbE NICs. If you just plug in a fast network card and do nothing, it’s like putting a Formula 1 engine in a golf cart—you’re not going to see any benefit. The cart’s frame (your kernel parameters) can’t handle the power.
net.core: The Foundation
This namespace controls settings for the core networking stack, regardless of the protocol. Think of it as the plumbing in your house.
net.core.rmem_max/net.core.wmem_max: These are the hard limits. No single socket’s read or write buffer can ever exceed this value, no matter what any other setting says. Set these too low, and you’ve put a ceiling on your potential performance. For a modern server, I start at 16MB (16777216) and go up from there for memory-rich boxes. The old default of 212992 bytes is a bad joke.net.core.netdev_max_backlog: When the kernel is receiving packets faster than a single CPU can process them, it queues them up. This setting is the length of that queue. If you’re seeing packet drops (ifconfigshowsoverruns) on a busy interface, bump this up from its paltry default of 1000 to something like 5000 or 10000.net.core.somaxconn: This is the maximum number of connections that can be waiting to beaccept()-ed by a listening socket. The classic mistake is setting your application’s listen backlog (e.g., in Python’ssocket.listen(128)) higher than this kernel parameter. The kernel will silently clamp it tosomaxconn. If you expect a huge flood of incoming connections (hello, everyone refreshing the product page at 9 AM), you need to raise this from its often-default 128 to something like 2048 or 4096. Otherwise, your shiny app is waiting in a line the kernel won’t let it form.
Let’s set a sane foundation. This isn’t the final tune, it’s just removing the artificial handcuffs.
# Make these changes immediately (as root)
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.netdev_max_backlog=5000
sudo sysctl -w net.core.somaxconn=2048
# To make them persist a reboot, add them to /etc/sysctl.conf
echo "
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.netdev_max_backlog=5000
net.core.somaxconn=2048
" | sudo tee -a /etc/sysctl.conf
net.ipv4.tcp_*: The TCP-Specific Magic (and Madness)
Here’s where we get into the real guts of it. TCP is a… complex protocol, and the Linux kernel’s implementation has more knobs than a cathedral organ.
net.ipv4.tcp_rmem/net.ipv4.tcp_wmem: These are not hard limits. They define three values for each socket:min default max. The kernel dynamically adjusts the buffer size between the min and max based on pressure. Themaxvalue must be <=net.core.rmem_max. The genius here is that a idle connection uses little memory, but a fast connection can automatically scale up. A good starting point is4096 87380 16777216.net.ipv4.tcp_window_scaling: This is non-negotiable. It must be 1. This enables TCP window scaling, which allows for windows larger than 64KB. Without it, your maximum achievable throughput on a high-latency link (satellite, cross-continent) is crippled. It’s 2024. If this is off, turn it on. Now.net.ipv4.tcp_sack: Selective Acknowledgment. Another one that should be on (1) unless you have a very good reason to turn it off. It allows recovering from multiple packet losses in a window more efficiently.net.ipv4.tcp_congestion_control: This is the algorithm that decides how fast to send data. The default is oftencubic, which is fine. But for long-fat networks (high bandwidth, high latency),bbr(Bottleneck Bandwidth and Round-trip propagation) is often vastly superior. It’s a more modern algorithm that tries to actually find the optimal sending rate instead of just reacting to loss. It’s worth testing.
# Tune the TCP buffers and enable essential features
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_window_scaling=1
sudo sysctl -w net.ipv4.tcp_sack=1
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
The Big Pitfall: Memory Pressure
Here’s the thing nobody tells you: if you set tcp_rmem.max to 16MB, you are giving the kernel permission to use up to 16MB per socket. On a server handling 10,000 connections, that’s a theoretical worst-case of 160 GB of RAM. The kernel is smart and won’t actually do that—it only scales up the buffers for connections that need it. But you must be aware of the trade-off. This is why you don’t just blindly set everything to gigabyte-sized buffers. You tune for your expected workload and available memory. The min and default values in tcp_rmem are your safety net for idle connections.
The best practice? Change one thing at a time and test. Use iperf3 to benchmark. Monitor for packet drops (ifconfig, ethtool -S eth0 | grep drop) and memory usage. The settings I’ve given are a starting point for a modern application server, not a divine revelation. Your mileage will vary, because your traffic, your hardware, and your problems are uniquely yours. Now go make that kernel work for its keep.