20.7 Backup and Restore for Redis: Snapshots and AOF

Right, let’s talk about backing up your Redis data. This isn’t a “nice to have.” It’s your get-out-of-jail-free card for the day someone fat-fingers a FLUSHDB command or an entire Availability Zone decides to take a nap. In ElastiCache, you’ve got two primary mechanisms for this: snapshots (RDB) and Append Only File (AOF). They’re fundamentally different, and understanding why you’d pick one over the other is more important than just knowing the AWS console buttons to click.

How Snapshots (RDB) Actually Work

Think of a snapshot as a point-in-time photograph of all your data. Redis forks. Yes, forks. The main process creates a child process—the little worker elf—whose sole job is to write the entire dataset to a durable .rdb file on disk. The beauty here is that the parent process keeps handling your requests while the kid does the hard work of saving. This is great for performance. The downside? If your Redis instance has 50 GB of data, that fork operation isn’t free; it doubles the resident memory footprint for a brief, terrifying moment. On a memory-constrained machine, this can cause latency spikes or even out-of-memory kills. It’s the price of admission.

You can trigger snapshots manually or automatically based on rules (e.g., save if 100 keys change in 5 minutes). In ElastiCache, you’re mostly managing this through their automated backup system. Here’s how you force a manual snapshot via the CLI. It’s stupidly simple.

aws elasticache create-snapshot \
    --cache-cluster-id my-redis-cluster \
    --snapshot-name my-manual-backup-20231001

Now, go to the AWS console and watch it spin. The key thing to remember: your cluster will be briefly unavailable for a few seconds at the start and end of the backup process while it syncs everything up. Don’t panic. Plan for it.

The Append Only File (AOF) for Paranoid Engineers

If RDB is a photograph, AOF is a tape recorder of every write command that ever happened. It appends every mutation (SET, SADD, DEL) to a file. On restart, Redis just replays the entire tape to rebuild its state. The durability is fantastic; you can lose, at most, one second of data (if you configure appendfsync everysec, which you should). The obvious problem? The file gets huge. And replaying a million commands on startup is slow.

Redis solves the huge-file problem with rewriting. It will fork a process (sound familiar?) to create a compacted AOF that contains the minimal set of commands needed to reconstruct the current dataset. ElastiCache manages all this file rotation for you in the background. To enable AOF, you have to do it at cluster creation. You can’t just flip it on later, which is a frankly bizarre AWS limitation.

aws elasticache create-cache-cluster \
    --cache-cluster-id my-robust-cluster \
    --engine redis \
    --cache-node-type cache.m6g.large \
    --snapshot-retention-limit 7 \ # Keeps snapshots for 7 days
    --engine-version "7.1" \
    --aof-enabled yes

Why You’d Use One, the Other, or Both

Use RDB snapshots if:

You need point-in-time backups for disaster recovery and can tolerate losing a few minutes of data.
You need a way to clone or duplicate your dataset for development.
You’re constrained on disk space. RDB files are compact.

Use AOF if:

You’re a total pessimist (I prefer “realist”) and your application cannot afford to lose even a single write command. Financial transactions, for example.
You’re willing to trade off slightly higher disk IOPS and storage cost for that peace of mind.

The truly bulletproof setup? Use both. Let ElastiCache perform its daily RDB snapshots and have AOF enabled. This gives you the point-in-time recovery and the granular replay. Your disk will hate you, but you’ll sleep like a baby.

The Gotchas They Don’t Tell You About

The Restore Isn’t Instantaneous: Restoring a backup, whether RDB or AOF, creates a brand new cluster. This isn’t a quick process. It can take tens of minutes. Your application needs to be able to handle being pointed at a new endpoint. This is a failover event, not a toggle.
AOF Can Get Corrupted: It’s a text file. If something goes wrong, it can become unreadable. The redis-check-aof tool can fix it by truncating to the last valid command, but you will lose whatever was after that. Test your restore procedures before you need them.
Cross-Region is Manual: ElastiCache automated backups are stored in S3, but in the same region as your cluster. For a real disaster scenario, you need to manually copy a snapshot to another region. Automate this. Now. I’ll wait.

# Copy a snapshot to us-west-2 for DR
aws elasticache copy-snapshot \
    --source-snapshot-name "arn:aws:elasticache:us-east-1:123456789012:snapshot:my-snapshot" \
    --target-snapshot-name my-snapshot-dr-copy \
    --target-region us-west-2

The bottom line? Your backup strategy is a direct reflection of how much pain you’re willing to endure later. Choose wisely. And for goodness sake, don’t just set it and forget it—do a practice restore every few months to make sure the plumbing actually works.