1.5 The Write-Ahead Log (WAL): Durability Without Flush-Per-Write
Right, let’s talk about the single most important reason you don’t lose data when your database server suddenly loses power, gets kicked by the datacenter janitor, or just decides to have a bad day. It’s not magic, it’s the Write-Ahead Log, or WAL. This is the unsung hero of your database’s durability, and understanding it is non-negotiable if you want to call yourself a Postgres professional.
The core problem is simple: writing data to your main table and index files (the “heap”) is slow. These files are large, scattered across the disk, and updating them involves a lot of random I/O. If we had to wait for a full fsync on these files to confirm every single INSERT or UPDATE, your database’s throughput would be measured in transactions per minute, not per second. It would be a disaster.
Postgres solves this with a brilliantly simple, “why didn’t I think of that?” idea: don’t write the data first, write a description of the change first. That’s the WAL. It’s a sequential, append-only log of every change made to the database. Because it’s sequential, writing to it is incredibly fast. The rule is ironclad: no change to the heap files is ever written until the change that describes it has been flushed to durable storage in the WAL.
How the WAL Saves Your Bacon
Here’s the crash scenario. You commit a transaction. Postgres writes the relevant WAL records to its in-memory buffer. The operating system might tell us it’s written to disk, but that’s often a lie—it’s probably just in the OS’s cache. If the power fails now, that data is gone. Poof.
To make the transaction truly durable, we need to get those WAL records from the OS’s cache onto the physical disk. This is the fsync() system call. This is the moment of truth. Once fsync returns successfully, you get your “commit successful” message. The actual corresponding heap data might still be sitting in Postgres’s shared buffers, unwritten to disk. And that’s fine.
If the server crashes five seconds later, upon restart, Postgres goes into recovery mode. It reads the WAL from the last known good checkpoint and says, “Ah, I see we had a transaction #12345 that was committed but never made it to the heap files. Let me replay that change from the WAL and fix it.” Your data is recovered. You lose nothing. This is durability.
Tuning the Safety Dial: synchronous_commit
Now, this fsync is a massive performance bottleneck. It’s a blocking operation that forces the disk to actually do its job. For some applications, you might be willing to trade a tiny amount of durability risk for a big performance gain. This is where synchronous_commit comes in. It’s your safety dial.
The default is on, which is full, by-the-book durability. But let’s look at your options.
-- The safe, default choice. Wait for WAL flush. Your data is safe.
SET synchronous_commit = ON;
-- A bit faster. Wait for the WAL write to the OS cache, but not the flush to disk.
-- You're safe from a Postgres crash, but not a full OS crash. Risky.
SET synchronous_commit = OFF;
-- My favorite for many read-heavy workloads. It's "ON" unless you're in a hot standby.
-- Perfect for single-node setups where you want full durability but less communication overhead.
SET synchronous_commit = LOCAL;
-- For when you're absolutely, positively sure you don't care about the last few transactions.
-- Only use this for bulk loading or on a temporary table. You will lose data on a crash.
SET synchronous_commit = OFF;
ALTER TABLE my_table SET (autovacuum_enabled = false); -- Often done together for bulk loads
The Pitfall: Never set synchronous_commit = off globally in postgresql.conf unless you truly understand the consequences. A power loss could mean losing the last few seconds of committed transactions. The designers gave you this power, but it’s your responsibility not to shoot yourself in the foot with it.
Checkpoints: The WAL’s Garbage Collection
If the WAL is an endless tape of every change, we’d eventually run out of disk space. This is where checkpoints come in. A checkpoint is a moment where Postgres forces all dirty buffers to be written to the heap files and then marks a point in the WAL as being “done.” Once a change has been safely written to the heap files, the WAL records for that change are no longer needed for crash recovery and can be recycled or removed.
You control how often this happens with checkpoint_timeout and max_wal_size. Tuning this is a balance: frequent checkpoints minimize recovery time after a crash but can hurt performance due to constant write bursts. Infrequent checkpoints are gentler on I/O but lead to larger recovery times.
-- Check the current configuration (output is example)
SHOW checkpoint_timeout;
-- checkpoint_timeout
-- ----------------------
-- 5min
SHOW max_wal_size;
-- max_wal_size
-- ---------------
-- 1GB
A common best practice on modern hardware with sufficient I/O capacity is to increase checkpoint_timeout to 15 or even 30 minutes to smooth out the write load, letting max_wal_size act as the safety valve. If the WAL hits the max_wal_size before the timeout, a checkpoint will be triggered early.
The wal_level and Replication
The WAL’s job isn’t just crash recovery. It’s also the beating heart of replication. To use logical replication or to have a hot standby replica, you need to set wal_level to logical or replica, respectively. This adds more information to the WAL records so that replicas can understand and apply them. This is a fantastic design—the same mechanism that guarantees durability for a single node also enables high availability and scalability across multiple nodes. It’s elegant as hell.
-- To set this, you need to restart Postgres. Plan for it.
-- Check the current level
SHOW wal_level;
-- wal_level
-- ------------
-- replica
-- To change it, edit postgresql.conf:
-- wal_level = logical
The trade-off is that higher wal_level settings can increase WAL volume slightly. It’s almost always worth it. The designers got this one right.