Alright, let’s talk about the cloud’s best-kept secret and my personal favorite way to save a fortune: Spot Instances. Think of them as the stock market for AWS’s leftover compute capacity. They have servers sitting around, not making money, and they’d rather sell you time on them for pennies on the dollar than have them idle. The catch? They can take them back from you with a two-minute warning whenever they need them for someone paying full price. It’s a steal, but you have to be ready for your stuff to get evicted.

This isn’t some niche tool for weirdos; it’s a core strategy for anyone doing batch processing, CI/CD, data analysis, or anything else that’s fault-tolerant. If your application can handle being abruptly shut down, you are literally leaving money on the table by not using Spot.

How the Spot Market Actually Works (It’s Not Magic)

Forget the idea of a constant price. The Spot price for each instance type in each Availability Zone is set by AWS based on long-term supply and demand. When spare capacity is high and demand is low, the price plummets. When capacity gets tight, the price rises. You set a maximum price you’re willing to pay per hour. Your instance will run as long as the current Spot price is below your max price. The moment the Spot price exceeds your max price—or, more commonly, when AWS simply needs the capacity back—your instance gets a termination notice. This is the infamous Spot Interruption.

The key thing to understand is that you aren’t bidding against other users. You’re setting a price ceiling. If the market price is $0.05 and you set your max to $10, you’ll still only pay $0.05. You’re just saying “I’m willing to pay up to $10 to keep this thing running.” The market price is the same for everyone.

The Two-Minute Warning: Your Graceful Exit

When AWS decides to reclaim your instance, they don’t just pull the plug. They send a termination notice to the instance metadata service. This is your application’s chance to panic gracefully. You’ve got two minutes to wrap up what you’re doing, save state, commit results, or shut down cleanly.

You can see this notice by querying a specific URL from within the instance itself. Here’s how you might check for it in a shell script that’s part of your application’s startup routine.

#!/bin/bash

# Function to check for termination notice
check_termination() {
    local notice_url="http://169.254.169.254/latest/meta-data/spot/termination-time"
    if curl -s --max-time 1 $notice_url | grep -q ".*T.*Z"; then
        echo "Spot termination notice received. Shutting down gracefully..."
        # Call your cleanup script here
        /usr/local/bin/my_cleanup_task.sh
        exit 0
    fi
}

# Check every 5 seconds in the background
while true; do
    check_termination
    sleep 5
done &

The real pros use the Instance Metadata Service v2 (IMDSv2), which is more secure. The code is a bit more verbose but worth it.

# Using IMDSv2 to get a token first, then check for termination
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
if curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q ".*T.*Z"; then
    echo "We're going down! Start cleanup."
fi

Picking Your Poison: Allocation Strategies

This is where you get smart. When you request Spot Instices, usually as part of an Auto Scaling Group or Spot Fleet, you tell AWS how to choose which instances to launch. Your main options are:

  • lowest-price: The default, and often the worst choice. It pools all your instances from the cheapest pools. If one pool gets drained, your entire capacity might fail because you put all your eggs in the cheapest basket.
  • capacity-optimized: This is the one you probably want. It looks at the pools with the most available capacity and the lowest chance of interruption. It’s brilliant because it prioritizes stability over saving the absolute last penny. The price difference is usually negligible, and the reduced interruption rate is worth its weight in gold.
  • price-capacity-optimized: A blend of the two above. It recommends pools that are both low price and high capacity. It’s a great balanced choice.

The lowest-price strategy is a classic example of a questionable default. It looks good on a marketing slide but can bite you in production. Use capacity-optimized.

The Gotchas: Where They Really Get You

  1. Stopping vs. Terminating: By default, when a Spot instance is interrupted, it’s terminated. Not stopped. This means your local ephemeral storage is nuked. If you want to preserve the root volume, you must change the interruption behavior to “Stop” instead of “Terminate”. But be warned: a stopped instance still costs you for the EBS volume, and it can’t be started again until its Spot pool becomes available, which might be never.
  2. The Rebalance Recommendation: This is a newer, sneakier type of signal. It’s not a termination notice; it’s an “hey, your instance is at high risk of interruption, you might want to proactively move your workload.” It’s a fantastic early warning system if you know how to listen for it.
  3. Not All Workloads Are Equal: Trying to run a monolithic database or a stateful game server on Spot? Don’t. You will have a bad time. Spot is for cattle, not pets. Design your applications to be stateless and fault-tolerant from the ground up.

Your Best Practice Cheat Sheet

  1. Use Diverse Instance Types: Don’t just request one instance type. Request a dozen. The more pools you pull from, the more likely you are to get and maintain capacity. An Auto Scaling Group with a dozen instance types across three Availability Zones is incredibly resilient.
  2. Embrace Spot Fleets or ASGs: Never manually manage Spot Instances. Always use a service that automatically handles the requests and replacements for you. Spot Fleet is the dedicated service for this; Auto Scaling Groups now have native Spot support and are often simpler.
  3. Combine with On-Demand: A hybrid strategy is often the best. Use On-Demand instances for the baseline, guaranteed capacity of your application (e.g., the minimum number of servers you always need) and use Spot instances for the scalable, bursty part (e.g., handling a traffic queue). This gives you both stability and massive savings.

Here’s a quick CloudFormation snippet showing a mixed instances policy for an ASG, which is the real-world way you’d do this. It uses capacity-optimized and a diverse list of instance types.

MyAutoScalingGroup:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    # ... your other properties like VPC, Subnets, etc ...
    MixedInstancesPolicy:
      InstancesDistribution:
        OnDemandBaseCapacity: 2
        OnDemandPercentageAboveBaseCapacity: 0
        SpotAllocationStrategy: capacity-optimized
      LaunchTemplate:
        LaunchTemplateSpecification:
          LaunchTemplateId: !Ref MyLaunchTemplate
          Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
        Overrides:
          - InstanceType: m5.large
          - InstanceType: m5d.large
          - InstanceType: m4.large
          - InstanceType: m6i.large
          - InstanceType: m5a.large

This setup ensures you always have at least 2 On-Demand instances running, and any scaling event will use the capacity-optimized strategy to pick from that diverse list of Spot instance types.

Mastering Spot is less about memorizing commands and more about adopting a mindset: embrace ephemerality, build for failure, and let the cloud’s fluctuations fund your next project. Now go save some money. You’ve earned it.