15.4 Fast Snapshot Restore and Snapshot Lifecycle Manager

Right, let’s talk about making your snapshots actually useful. You’ve dutifully taken them, they’re sitting there in S3, and you’re patting yourself on the back for being a responsible cloud citizen. But here’s the cold, hard truth: a standard snapshot is like a can of soup in your pantry. It’s there, but it’s not dinner until you heat it up. And heating it up—restoring it to a new volume—takes time. Often hours, depending on size. That’s a non-starter for any application that needs to get back online now. That’s where our first hero, Fast Snapshot Restore, comes in.

Fast Snapshot Restore: Your Get-Out-of-Jail-Free Card

Fast Snapshot Restore (FSR) is AWS’s solution to the “lazy restore” problem. When you enable FSR on a snapshot in a specific Availability Zone, AWS pre-warms it. They take that data and actually hydrate it onto physical hardware in that AZ, so it’s ready to go. Creating a new volume from an FSR-enabled snapshot is blisteringly fast—we’re talking seconds or minutes, not hours. It’s like they pre-made your dinner and put it in the microwave, all you have to do is hit ‘start’.

But of course, there’s a catch. Actually, there are several, because this is AWS and nothing fantastic is ever free or simple.

It’s stupidly expensive. We’re talking several hundred dollars per snapshot per AZ per month. This is not for your daily dev snapshots. This is for your mission-critical, “if this goes down we lose $50k a minute” volumes. Use it judiciously.
You have to enable it per snapshot per AZ. It’s not a global setting. You want your production database snapshot to be fast in us-east-1a and us-east-1b? That’s two separate FSR enablements and two separate charges.
There are limits. By default, you can only have 50 FSR enablements per region. You can beg AWS for more, but they will rightly ask you why you’re using this sledgehammer on so many nails.

Here’s how you enable it for that one golden snapshot. Notice how you have to specify the AZ. This is a billing operation as much as a technical one.

# Enable FSR for snapshot 'snap-1234567890abcdef0' in us-east-1a
aws ec2 enable-fast-snapshot-restores \
  --availability-zones us-east-1a \
  --source-snapshot-ids snap-1234567890abcdef0

# Later, when the inevitable happens, create a volume from it. It'll be ready almost instantly.
aws ec2 create-volume \
  --availability-zone us-east-1a \
  --snapshot-id snap-1234567890abcdef0 \
  --volume-type gp3

The key insight here is that FSR is for disaster recovery scenarios where recovery time objective (RTO) is measured in seconds. For everything else, you need a way to manage the lifecycle of your snapshots so you’re not drowning in storage costs or manual work. Enter the Snapshot Lifecycle Manager.

Snapshot Lifecycle Manager: Automating the Boring (But Critical) Stuff

The Snapshot Lifecycle Manager (SLM) is what you wish you had set up before you got that alarming bill from AWS. It’s a policy-based tool that automatically creates, retains, and—crucially—deletes your EBS snapshots on a schedule. It stops your account from becoming a digital hoarder’s paradise.

You define a policy with a schedule (e.g., “every day at 23:00 UTC”), a retention policy (e.g., “keep the last 14 daily snapshots”), and which volumes to target using tags. This is why tagging your resources isn’t just good hygiene; it’s absolutely mandatory for any semblance of automation. If you aren’t tagging, you’re doing it wrong.

Let’s create a policy that backs up our production volumes every day and keeps two weeks worth. Notice the Tags section. This policy will only act on volumes that have both Environment=Production and Backup=true.

# Create an IAM role for SLM first (AWS provides a handy policy template for this)
# Then, create the lifecycle policy JSON file: policy.json
{
  "Description": "Daily backup for production volumes",
  "ExecutionRoleArn": "arn:aws:iam::123456789012:role/SLMRole",
  "PolicyDetails": {
    "ResourceTypes": ["VOLUME"],
    "TargetTags": [
      {
        "Key": "Environment",
        "Value": "Production"
      },
      {
        "Key": "Backup",
        "Value": "true"
      }
    ],
    "Schedules": [
      {
        "Name": "DailyBackups",
        "CreateRule": {
          "CronExpression": "0 23 * * *" # Every day at 23:00
        },
        "RetainRule": {
          "Count": 14
        },
        "CopyTags": true # This is a best practice – copies tags from volume to snapshot
      }
    ]
  }
}

# Create the policy
aws ec2 create-snapshot-lifecycle-policy \
  --policy-details file://policy.json

The Devil’s in the Details: Common SLM Pitfalls

SLM is great, but it has its quirks, and if you don’t understand them, you’ll get burned.

Asynchronous Execution: When the policy runs, it doesn’t snap every volume at the exact same moment. It kicks off the process asynchronously over a period of time. Don’t expect millisecond consistency across all volumes.
Tagging is Everything: I cannot stress this enough. A volume without the right tags will be completely ignored by SLM. It will not snap it. It will not warn you. It will just skip it. Your backup strategy is only as good as your tagging strategy.
Deletion is Final: When SLM deletes an old snapshot because it’s fallen out of the retention window, it’s gone. There is no “Oops, I didn’t mean that” recycle bin. Make damn sure your retention rules are correct before you attach a policy to hundreds of volumes.
API Limits: If you have thousands of volumes, SLM might hit API rate limits when creating all the snapshots at once. If you’re at that scale, you might need to stagger your policies or get a limit increase.

The one-two punch of SLM for automated, cost-controlled backups and FSR for your most critical recovery scenarios is how you build a robust storage strategy. One handles the boring, daily grind. The other is your panic button for when things go truly sideways. Use both wisely.