35.5 archive_command and restore_command Configuration

Right, so you’ve decided you don’t want to lose your data. Good for you. This isn’t just about making copies; it’s about building a fire escape for your database. The archive_command and restore_command are the two most critical, and most frequently botched, parts of that escape plan. They are the workhorses of Point-in-Time Recovery (PITR), and if you set them up wrong, your beautiful, redundant WAL archive is just a bunch of useless bits sitting on a disk somewhere. Let’s get it right.

The core concept is simple, but its implications are everything: PostgreSQL continuously writes all changes to Write-Ahead Log (WAL) files. A base backup from pg_basebackup is a snapshot of your data directory at a specific moment, but it’s useless by itself. To recover to any point in time after that snapshot, you need the sequence of WAL files generated since the backup was taken. The archive_command is how you ship each finished WAL file to a safe, durable location. The restore_command is how the server, when in recovery mode, fetches them back.

The `archive_command`: Your Data Shipping Clerk

This is a shell command that PostgreSQL executes every time it finishes a WAL segment. Its job is to take the file it’s given and copy it to your archive location. The magic is that you can make this command do anything. The most common choice is simply using rsync or scp, but let’s be a bit more robust and use cp to a mounted network filesystem, which is dead simple and reliable.

You configure this in postgresql.conf. Here’s a typical, if naive, example:

# In postgresql.conf
archive_mode = on
archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'

Let’s autopsy that command, because getting it wrong means silent failure and a non-recoverable database.

%p is replaced by the full path to the WAL file PostgreSQL wants to archive.
%f is replaced by just the filename of the WAL file (e.g., 0000000100000001000000A2).
The test ! -f ... part is a safety check. It ensures we don’t overwrite a file that already exists in the archive. This shouldn’t happen, but if it does, you want the command to fail (non-zero exit code) so PostgreSQL will try to archive it again later. A command that exits with 0 tells PostgreSQL “all good, this WAL is safely archived,” even if you just clobbered a file or copied it to /dev/null. This is the most common pitfall. Your command MUST return a non-zero exit code on failure.

A more paranoid and better practice is to copy the file to a temporary name first, then atomically rename it. This prevents another process from seeing a partially written WAL file.

# A more robust archive_command
archive_command = 'cp %p /mnt/wal_archive/%f.tmp && mv /mnt/wal_archive/%f.tmp /mnt/wal_archive/%f'

The `restore_command`: Your Data Retrieval Hound

This command is only used during recovery. Its job is the inverse of archive_command: it takes a WAL filename (%f) and must find a way to get that file from your archive and place it where the recovering PostgreSQL server expects it (%p).

A typical restore_command in your recovery.conf (PostgreSQL 12 and earlier) or in postgresql.conf with recovery_target settings (PostgreSQL 13+) might look like this:

# In recovery.conf (<=12) or postgresql.conf (>=13)
restore_command = 'cp /mnt/wal_archive/%f %p'

Again, the exit code is crucial. This command must return a non-zero exit code if the requested file is not found. This is how PostgreSQL knows it has reached the end of your available WAL archive and should stop recovering. If it returns 0 but provides an empty or corrupt file, recovery will fail spectacularly. If it always returns 0 (e.g., by outputting a blank file), the server will hang forever waiting for the next file.

Testing: Your Only Hope in a Disaster

You think you’ve set this up? Wonderful. Now prove it. The number of people who skip this step is astronomically high, and they are always the ones weeping during a real outage.

Force a WAL switch and check the logs:
```
psql -c "SELECT pg_switch_wal();" # PostgreSQL 11+
# or
psql -c "SELECT pg_switch_xlog();" # PostgreSQL 10 and earlier
```
Immediately check your PostgreSQL log. It should show the execution of your archive_command and, critically, whether it succeeded or failed. Then, physically go to your archive location (/mnt/wal_archive/ in our example) and verify the file is there.
Perform a dry-run recovery. This is non-negotiable. Take a pg_basebackup, shut down the server, and deliberately corrupt a table. Then, try to use your base backup and WAL archive to recover to a point just before the corruption. If you can’t do it calmly on a Tuesday afternoon, you certainly won’t pull it off at 3 a.m. during a production crisis.

The designers actually got this one right. The system is brilliantly simple and incredibly powerful because it gives you all the rope. You can archive to anywhere—S3, Azure Blob Storage, another planet—as long as you can write a shell command to put a file there and another to get it back. Just remember: the shell command is your responsibility. Make it robust, test it relentlessly, and for heaven’s sake, mind the exit codes.

The archive_command: Your Data Shipping Clerk

The restore_command: Your Data Retrieval Hound

Testing: Your Only Hope in a Disaster

The `archive_command`: Your Data Shipping Clerk

The `restore_command`: Your Data Retrieval Hound