8.8 Instance Refresh: Rolling AMI Updates in an ASG

Right, so you’ve got your Auto Scaling Group humming along, serving traffic, feeling good about itself. But here’s the problem: the Amazon Machine Image (AMI) it’s using is starting to feel a little… vintage. Maybe there’s a critical security patch, a new kernel version, or you’ve just perfected your application’s baked-in dependencies. You need to deploy a new AMI. Your first thought might be to just update the launch template and let the ASG work its magic, but if you do that, you’ll quickly learn that an ASG’s default magic is more of a blunt instrument. It will, quite merrily, terminate your old instances and launch new ones all at once, causing a delightful little service outage.

This is where the Instance Refresh feature comes in. Think of it as your ASG’s built-in rolling deployment mechanism. It’s how you tell your ASG, “Hey, upgrade these instances for me, but for the love of all that is holy, do it one (or a few) at a time and make sure the new ones are healthy before you kill the old ones.” It’s the difference between a controlled demolition and just pushing the building over.

How It Actually Works (The Polite Queue Theory)

An Instance Refresh doesn’t just randomly pick instances to kill. It works with the ASG’s existing lifecycle hooks and health checks to be a good citizen. You kick it off, and it essentially creates a polite queue. It will first terminate instances that are oldest first. This is a sensible default—it assumes your oldest instances are the most likely to have accumulated weird state or drift.

Before it terminates an instance, it waits. It waits for a new instance to be launched from your new AMI. It then waits for that new instance to pass the ASG’s health checks (either the EC2 status checks or, more importantly, any ELB health checks you have configured). Only after the replacement is deemed healthy and is presumably serving traffic does it move on to the next oldest instance. This process repeats until every instance in the group has been replaced. If at any point a new instance fails to become healthy, the refresh can be configured to either roll back automatically or just stop and let you figure out the mess you’ve made.

Starting a Refresh: The Code

You can start a refresh via the AWS CLI, SDK, or Console. Here’s the CLI command. Notice the --preferences section—this is where you dictate the terms of the rollout.

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-awesome-asg \
  --preferences '{
    "InstanceWarmup": 300,
    "MinHealthyPercentage": 90,
    "CheckpointPercentages": [25, 50, 75],
    "CheckpointDelay": 120,
    "SkipMatching": false,
    "AutoRollback": true
  }'

Let’s unpack this. MinHealthyPercentage: 90 means it will never let the number of healthy instances dip below 90% of the desired capacity. If you have 10 instances, it will only ever work on 1 at a time (because 9/10 = 90%). InstanceWarmup: 300 gives your application 300 seconds (5 minutes) to boot up and start passing health checks before the ASG gets impatient. This is crucial for apps with long startup times.

The CheckpointPercentages and CheckpointDelay are your safety brakes. The refresh will pause at 25%, 50%, and 75% complete for 120 seconds, giving you a chance to manually verify everything is working before it continues.

The Pitfalls and “Questionable Choices”

This is powerful, but it’s not magic. The designers made some choices you need to be aware of.

First, the default values are… optimistic. If you don’t specify preferences, it uses a default MinHealthyPercentage of 100% for groups with a load balancer and 0% (!) for those without. The 0% setting is a trap. It means it will cheerfully terminate all your instances at once if you’re not attached to a load balancer. Always, always define your preferences.

Second, health checks are everything. If your application takes 400 seconds to start passing its ELB health check but you left InstanceWarmup at the default 300, the refresh will fail. It will think the new instance is unhealthy and, if AutoRollback is true, it will terminate it and revert the launch template. Your refresh will fail, and you’ll be staring at CloudWatch logs wondering what went wrong. Your InstanceWarmup must be longer than your application’s startup time.

Third, SkipMatching is a weird one. If set to true, it will skip replacing instances that are already running the new launch template. This sounds good, but it can lead to a mixed environment if a previous refresh failed halfway. Sometimes you want this, often you don’t. Know what you’re choosing.

Best Practices: Don’t Be a Hero

Test in a Staging Environment First: The behavior of your application and its health checks under a rolling deploy is different. Test it.
Use Meaningful Checkpoints: For a large production deployment, use those checkpoints. Pause at 25% and 50%, run some smoke tests, then let it continue.

Monitor the Refresh: The CLI command returns a RefreshId. Use describe-instance-refreshes to track its status.

aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name my-awesome-asg \
  --instance-refresh-ids <your-refresh-id>

Keep the Group Balanced: Instance Refresh works best when your ASG is operating normally. Don’t try to run a refresh while you’re also manually messing with the desired capacity or executing other scaling policies. Let the process do its job.

Used correctly, Instance Refresh is the definitive way to deploy a new AMI without causing an incident. It removes the need for clunky custom scripts and leverages the built-in logic of the ASG. Just remember: it trusts your health checks implicitly, so you’d better make sure your health checks are telling it the truth.