34.5 Replication Slots: Ensuring WAL Retention
Right, let’s talk about replication slots. You’re probably here because you’ve seen the dreaded WARNING: oldest xmin is far in the past or, worse, a standby has fallen off the wagon because it couldn’t get the WAL files it needed. Replication slots are the solution to that second, more insidious problem. They’re a way to tell the primary server, “Hey, don’t you dare delete that WAL file until you are absolutely, positively sure my standby has consumed it.”
Think of it this way: without a slot, wal_keep_size (or the older wal_keep_segments) is your only safeguard. It’s a blunt instrument. You set it to, say, 10GB, and hope that’s enough to cover any network outage or standby server reboot. It’s a guess. And if your guess is wrong, the primary merrily deletes old WAL, your standby can’t catch up, and you’re left with a replica that’s as useful as a chocolate teapot. A replication slot, on the other hand, is a direct promise from the primary to a specific standby. It’s a contract that says, “I will keep what you need, for as long as you need it.”
The Nuts and Bolts of a Physical Slot
For physical streaming replication (the most common kind), you create a physical replication slot. This slot exists solely to track the WAL position (restart_lsn) of the standby it’s attached to. Here’s how you create one. You can’t do it with a simple INSERT; you have to use a dedicated function.
SELECT * FROM pg_create_physical_replication_slot('my_standby_slot_1');
slot_name | xmin | lsn
-----------------+------+-----
my_standby_slot_1 | |
Notice the lsn is null? That’s because the slot is created but not yet being used by any consumer. It’s inert. To use it, you point your standby’s recovery.conf (PostgreSQL 12 and earlier) or postgresql.conf (PostgreSQL 13+) to it.
For PostgreSQL 13 and newer, in your standby’s postgresql.conf:
primary_slot_name = 'my_standby_slot_1'
Once the standby starts and begins streaming, it will “reserve” the slot, and you’ll see its LSN populate in the pg_replication_slots view.
The View From the Bridge: Monitoring Slots
This is the most important part. You must monitor your slots. A slot that isn’t being actively consumed is a disaster waiting to happen. It will prevent the cleanup of WAL files, and if the disconnected standby never comes back, your primary’s disk will fill up. I’ve seen it happen. It’s not pretty.
The system view pg_replication_slots is your best friend here.
SELECT slot_name, slot_type, active, restart_lsn, confirmed_flush_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots;
active: This is the big one.truemeans a standby is currently connected and streaming from this slot. If this isfalsefor too long, sound the alarm.restart_lsn: The oldest WAL location (LSN) that the standby might still need. This is what the primary uses to know what it can’t delete.lag_bytes: How far behind the standby is (in bytes). A large, growing number here indicates the standby is falling behind.
The Danger Zone: Orphaned and Lagging Slots
This is the critical “gotcha.” A replication slot is a promise to keep data. If the standby using it goes away permanently—maybe you decommissioned it and forgot—the primary will just keep every WAL file since that standby’s last known position. Forever. Or until the disk is full and the primary crashes.
You must drop unused slots. If you retire a standby, its slot must be retired too.
SELECT pg_drop_replication_slot('my_old_standby_slot');
Similarly, a slot that is active but has a huge lag_bytes value is a sign of a struggling standby. The primary is still retaining all that WAL, putting pressure on its storage. You need to fix the standby’s performance issues (is it I/O bound? CPU bound?) or else risk the primary’s stability.
Logical Replication Slots: A Quick Word
We’ve been talking about physical slots for streaming replication. There’s also a beast called a logical replication slot. These are used for logical decoding (e.g., the pgoutput plugin for publications/subscriptions). Their job isn’t just to track a WAL position (restart_lsn), but also a “commit” position (confirmed_flush_lsn) within that stream of changes.
They’re even more dangerous than physical slots if left unattended because they can also prevent vacuum from cleaning up old rows, leading to table bloat and transaction ID wraparound. The same rule applies: monitor them like a hawk and drop them the instant they are no longer needed.
In short, replication slots are the only sane way to ensure your standbys don’t fall behind due to WAL recycling. They turn a guessing game into a guarantee. But that guarantee comes with a responsibility: you are now the manager of these promises. You must monitor them, maintain them, and know when to break them by dropping a slot that’s become a liability. It’s a small price to pay for rock-solid replication.