35.6 Testing Your Backups: Restore Drills
Right, let’s get this out of the way: if you haven’t actually restored from your backup, you don’t have a backup. You have a hopeful ritual. You’re performing a rain dance and praying for precipitation. A restore drill is the only way to turn that prayer into a verified, working fact. It’s the difference between “I think this will work” and “I know this will work because I did it last month and it was a pain, but it worked.” We’re going to make that pain predictable.
The single biggest pitfall here is doing this for the first time during an actual emergency. Panic, stress, and a screaming manager hovering over your shoulder are not ingredients for a successful first attempt. Schedule this. Put it in the calendar. Call it “Disaster Recovery Game Day.” It sounds more official and management loves games.
The Anatomy of a Restore Drill
Your drill needs a target. Never, ever test a restore on your primary production database. The goal is to create a completely isolated clone. My preferred method is to spin up a throwaway VM or use a dedicated staging server. The process is a brutal, step-by-step logic puzzle.
First, you need to get your backup files onto the target system. Let’s assume you have a logical backup from pg_dump sitting on a backup server.
# On your backup server
scp /backups/my_database_20231027.sql dba@recovery-vm:/tmp/
# On your recovery VM, first ensure PostgreSQL is stopped and the old data dir is nuked.
sudo systemctl stop postgresql
rm -rf /var/lib/postgresql/16/main/*
Now, we initialize a new, empty database cluster. This is crucial because we need a clean slate; restoring over a partially corrupted or existing cluster is a recipe for bizarre errors.
sudo -u postgres /usr/lib/postgresql/16/bin/initdb -D /var/lib/postgresql/16/main/
sudo systemctl start postgresql
And now, the moment of truth. We feed our SQL file to the psql client. The -v ON_ERROR_STOP=ON flag is non-negotiable. It turns a trickle of errors into a hard stop, forcing you to pay attention to what’s broken.
sudo -u postgres psql -v ON_ERROR_STOP=ON -f /tmp/my_database_20231027.sql postgres
Watch the output like a hawk. Any ERROR-level message means your backup is not clean. It might still be usable, but it’s not clean. Common culprits are missing extensions or attempts to create pre-existing objects. You need to understand each one.
Testing Physical Backups and PITR
Testing a physical base backup with pg_basebackup is a different beast. You’re copying the entire cluster’s data files, so the restore is more of a file operation than a SQL one.
Here’s how you test a pg_basebackup tar archive. Extract it to your clean data directory, but you MUST provide the postgres user with ownership first.
sudo systemctl stop postgresql
rm -rf /var/lib/postgresql/16/main/*
sudo chown postgres:postgres /var/lib/postgresql/16/main
# Extract the backup as the postgres user. The -k flag is magic; it restores symlinks correctly.
sudo -u postgres tar -xvf /backups/base_backup.tar.gz -C /var/lib/postgresql/16/main/ -k
Now, here’s the designer’s questionable choice you must remember: after a physical restore, PostgreSQL will look for a recovery.signal file in the data directory. If it finds it, it will attempt to replay WAL files. For a simple base backup test, you don’t want this. You want to see if the base backup itself is consistent.
# To test JUST the base backup, create a simple recovery.conf to stop it immediately.
sudo -u postgres touch /var/lib/postgresql/16/main/recovery.signal
echo "recovery_target = 'immediate'" | sudo -u postgres tee -a /var/lib/postgresql/16/main/postgresql.auto.conf > /dev/null
sudo systemctl start postgresql
The server will start, enter recovery mode, hit the ‘immediate’ target, and then promote itself to a primary. Check the logs to ensure it worked and then connect to see if your data is there.
Verifying It Actually Worked
Don’t just check if the server starts. That’s a depressingly low bar. Your verification script should connect to the restored database and run sanity checks.
-- Check that tables exist and have the right number of rows.
SELECT schemaname, relname, n_live_tup
FROM pg_stat_user_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema');
-- Check a critical table's most recent entry.
SELECT MAX(created_at) FROM important_audit_table;
-- Verify a specific value is present.
SELECT * FROM users WHERE email = 'ceo@company.com';
Automate this. Put these queries in a script. The output should give you a clear, binary “YES” or “NO” about the backup’s validity.
The ultimate, gold-standard test? Point a copy of your application at this restored database and let it run for a few minutes. See if it throws errors. This tests not just the data’s existence, but its integrity. It’s the only way to be truly sure. It’s a bit more work, but so is explaining to everyone why the “working” backup you had didn’t actually work with the app. Do the work.