25.7 Bonding and Bridging: Link Aggregation and VM Networking

Right, so you’ve got a bunch of physical network links and a pile of virtual machines. You could just plug things in and hope for the best, but that’s like using a single, rickety plank to cross a chasm when you’ve got a whole stack of them right next to you. Let’s talk about combining those links for more throughput and reliability (bonding), and creating the virtual switches that your VMs will plug into (bridging). This is where your server stops being a passive endpoint and starts being the network.

First, let’s get our terms straight because the networking world loves synonyms just to keep you on your toes. Bonding and Teaming are, for all intents and purposes in the Linux world, the same damn thing. It’s the act of aggregating multiple physical network interfaces (NICs) into a single logical one. The kernel calls it bonding, and systemd calls it teaming, but they solve the same problem. We’ll use bonding because that’s what the kernel module calls it. Bridging is essentially building a software-based switch inside your Linux box. It’s what allows you to attach multiple virtual or physical interfaces to the same network segment, so your VMs can talk to each other and the outside world as if they were plugged into a physical switch on your desk.

The Why and How of Bonding Modes

You don’t just slap a bond together willy-nilly. The mode you choose (it’s in /proc/net/bonding/bond0 later, I promise) dictates its personality, and picking the wrong one is a classic way to create a spectacularly subtle failure.

mode=0 (balance-rr): Round-robin. It sends packet 1 out interface A, packet 2 out interface B, and so on. This is the only mode that can theoretically increase throughput for a single TCP stream. Sounds great, right? Well, it’s mostly a trap. The remote end will get packets wildly out of order, and the TCP stack will freak out, thinking there’s congestion, and throttle everything into oblivion. It’s genuinely absurd. Use this for a demo of what not to do.
mode=1 (active-backup): One interface is active, the others are on standby. It’s dead simple and provides fault tolerance only. No throughput gain. It’s the reliable, boring pickup truck of bonding modes. You use it for your hypervisor’s management interface because you just need it to always work.
mode=4 (802.3ad): This is the one you actually want for throughput. It’s the Link Aggregation Control Protocol (LACP) mode. It requires support from your switch (you must configure the port group as an LACP dynamic channel). LACP does the magic of negotiating the bundle and ensuring frames for a single conversation (based on a hash of layer2/3/4 addresses) always go down the same physical link to prevent packet reordering. This is how you get both fault tolerance and increased throughput.

Here’s how you set up a basic mode-4 bond using netplan on Ubuntu. This is the modern way, and frankly, it’s less of a headache than the old ifupdown scripts.

# /etc/netplan/01-netcfg.yaml
network:
  version: 2
  renderer: networkd
  bonds:
    bond0:
      interfaces: [enp1s0f0, enp1s0f1]
      parameters:
        mode: 802.3ad
        lacp-rate: fast
  ethernets:
    enp1s0f0:
      dhcp4: no
    enp1s0f1:
      dhcp4: no
    br0:
      dhcp4: no
  bridges:
    br0:
      interfaces: [bond0]
      dhcp4: yes

Note the lacp-rate: fast. The default is slow (every 30 seconds), but fast (every second) means your bond detects failures and recovers much quicker. Your switch needs to support it, but any semi-modern switch does.

Building the Virtual Switch (Bridging)

The bond is your big, fat pipe to the physical world. Now we need to plug our VMs into it. That’s what the bridge (br0) is for. In the config above, you see we created the bond bond0 and then added the bond as a port to the bridge br0. This is the critical part. The bridge is the virtual switch; its interface (br0) gets the IP address and is the host’s connection to the network. The physical interfaces (and the bond aggregating them) are just dumb ports on that switch.

For VMs (e.g., with libvirt), you’d tell them to connect to the bridge br0. The VM’s traffic flows into the bridge, which then forwards it out via the bond to the physical network. It’s beautifully straightforward once you visualize it.

Common Pitfalls and The Gotchas

Switch Configuration: This is the number one reason bonds fail. For mode=4 (802.3ad), your switch ports must be configured in an active LACP channel. If you don’t, you’ll create a lovely loop and watch your network melt down. I’ve seen it happen. It’s not pretty. Always configure the switch first.
The “No IP on the Physical Interface” Rule: This is a rookie mistake. Notice in the netplan config, the physical interfaces (enp1s0f0, enp1s0f1) and the bond (bond0) have dhcp4: no. Only the bridge (br0) gets the IP. If you put an IP on a physical interface that’s part of a bond or bridge, the kernel gets confused about where to route traffic. Don’t do it.
Checking Your Work: Once it’s up, prove it’s working. cat /proc/net/bonding/bond0 is your best friend. It will show you the bonding mode, which slaves are active, and detailed statistics. If a link goes down, this file will tell you immediately. For the bridge, use brctl show or the more modern bridge link show.
MTU Mismatch: If you’re using jumbo frames (please tell me you thought about this), the MTU must match everywhere: physical NICs, bond, bridge, and the switch ports. A mismatch will cause mysterious packet loss that will drive you insane.

The designers actually got this one mostly right. The Linux kernel’s bonding and bridging drivers are rock solid. The questionable choice was the historical mess of configuration tools (ifconfig, ifupdown, brctl), but netplan and NetworkManager are finally cleaning that up. It’s a bit of a pain to learn a new syntax, but the consistency is worth it. Now go build yourself a redundant, high-throughput virtual network. You’ve got the blueprint.