34.3 Synchronous vs Asynchronous Replication

Right, let’s settle the great debate: should your replica be a dutiful, “Yes, sir, right away, sir!” subordinate, or more of a “I’ll get to it when I get to it” kind of background process? This is the core of synchronous versus asynchronous replication, and the choice is far more profound than a simple checkbox. It’s a trade-off between absolute data safety and raw performance, and getting it wrong can lead to some spectacularly unpleasant outcomes.

The Core Difference: Acknowledgment and Blocking

The difference is brutally simple. With asynchronous replication, the primary server commits a transaction the moment it’s done locally. It then says, “Okay, I’ll send that write-ahead log (WAL) data to the replicas… eventually.” Your commit returns immediately, and your application gets on with its life. The replica applies the changes as fast as it can, but it’s always playing catch-up. This is the default, and for good reason: it’s fast.

Synchronous replication, on the other hand, makes the primary server a control freak. It won’t commit a transaction and report success back to your client until it has received an acknowledgment from at least one synchronous replica that they have received and flushed the WAL data to disk. Your client application sits and waits for that confirmation. This guarantees that for every committed transaction on the primary, there is at least one other server that has a durable copy of the data. The cost? Latency. Every commit now has a round-trip network journey added to it.

Here’s the mental model: Async is like sending a text message. You hit “send” and your phone says “Delivered!” but you have no idea if the recipient has actually read it. Sync is like a read receipt. Your message doesn’t say “Delivered!” until their phone explicitly confirms “Hey, I’ve seen this.”

Configuring the `synchronous_standby_names` Knob

This is where you define your sync squad. The primary uses this parameter to know who it needs to wait for. The syntax is… well, it’s its own little DSL, honestly. Let’s say you have two replicas named node2 and node3.

To make node2 your one and only synchronous replica:

synchronous_standby_names = 'node2'

But what if node2 crashes? The primary will get stuck waiting for an acknowledgment from a server that’s offline, effectively freezing all commits. That’s… bad. To avoid this, you define a quorum. You tell the primary: “I need at least 1 acknowledgment from any of these servers.”

synchronous_standby_names = 'ANY 1 (node2, node3)'

Now, if node2 dies, the primary will just wait for node3 instead. No freezing. You can also get fancy with priorities:

synchronous_standby_names = 'FIRST 1 (node2, node3)'

This means it will prefer node2, but if node2 lags or disconnects, it will fail over to node3 without requiring both. The ANY syntax is more democratic, while FIRST is hierarchical.

The Performance Hit: It’s All About Latency

Let’s be direct: synchronous replication will make your writes slower. There’s no way around it. The commit now has to wait for a network round-trip to your replica(s) plus the time it takes for that replica to fsync the WAL to its disk. The impact is almost entirely defined by your network latency.

If your primary and replica are in the same data center with a sub-1ms latency, the hit might be barely noticeable for most applications. If you’re trying to synchronously replicate from New York to Tokyo, you’re looking at hundreds of milliseconds of added latency per commit. Your application will feel like it’s running in molasses. This is why synchronous replication across vast geographical distances is almost always a non-starter.

The Pitfall: Cascading Timeouts and Freezes

Here’s the nightmare scenario you must design to avoid. Your application has a statement timeout set. The primary is waiting for its sync replica to acknowledge. But the sync replica is having a bad day—maybe its disk I/O is saturated, or there’s a network blip. The primary keeps waiting.

Eventually, your application’s timeout fires. It tells the database to cancel the query. But here’s the cruel part: the primary is waiting, not executing. It can’t be interrupted in the same way. So your application gives up, but the primary is still stuck waiting, holding open the database connection and potentially blocking other transactions.

The only way to break this deadlock is for the replica to come back, or for you to manually intervene by either restarting the replica or reconfiguring synchronous_standby_names on the primary to remove the problematic server. You need robust monitoring for replica lag and a clear playbook for when a sync replica falls behind or dies.

Best Practice: A Practical, Hybrid Approach

Very few people run with synchronous_standby_names set permanently. The smart money is on a hybrid approach. Use asynchronous replication for 99.9% of your traffic for its blistering speed.

Then, for the transactions where data loss is absolutely unacceptable—think finalizing a financial transaction or updating a user’s primary email address—force synchronization for just that one transaction. You can do this at the session level. Have your application layer issue this command right before the critical operation:

SET LOCAL synchronous_commit TO ON;
-- Your super important UPDATE or INSERT here

Once the transaction is done, the setting reverts to whatever the session was using before. This gives you surgical control. You pay the latency penalty only when the data’s worth it. For everything else, the async default gives you speed. It’s the best of both worlds: the performance of async with the safety of sync when you need it most.

The Core Difference: Acknowledgment and Blocking

Configuring the synchronous_standby_names Knob

The Performance Hit: It’s All About Latency

The Pitfall: Cascading Timeouts and Freezes

Best Practice: A Practical, Hybrid Approach

Configuring the `synchronous_standby_names` Knob