40.5 Real-Time Inference Endpoints: Auto Scaling and Multi-Model Endpoints

Right, so you’ve trained a model that’s a veritable genius at identifying pictures of cats wearing tiny hats. Fantastic. But now what? You can’t just email the model file to your production team and call it a day. You need to serve predictions, and you need to do it at scale, without melting your credit card into a puddle. Welcome to the world of SageMaker real-time endpoints. This is where your model meets the real world, and the real world is a demanding, fickle jerk.

Think of a real-time endpoint as a highly specialized, always-on API. You send it some data (an image of a cat, a row of tabular data, etc.), and it sends back a prediction (e.g., “Hat: 98.7% confidence”). The magic—and the complexity—lies in how SageMaker manages the underlying compute resources to keep this API responsive, cost-effective, and available, even when traffic decides to do its best impression of a hockey stick graph.

The Anatomy of an Endpoint Config

Before you even think about scaling, you have to define what is being scaled. This happens in the EndpointConfig, and it’s where you make your most crucial decisions. You’ll specify the model, the instance type (like ml.m5.large), and crucially, the initial instance count.

from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker import get_execution_role

# Assuming you already have a model.tar.gz in S3 from a training job
image_uri = '123456789012.dkr.ecr.us-west-2.amazonaws.com/my-custom-algo:latest'

model = Model(
    model_data='s3://my-bucket/path/to/model.tar.gz',
    image_uri=image_uri,
    role=get_execution_role(),
)

# This is the crucial object: the Endpoint Configuration
predictor = model.deploy(
    initial_instance_count=2,  # Start with two instances for redundancy
    instance_type='ml.m5.xlarge',
    endpoint_name='my-cat-hat-endpoint',
    wait=True  # This blocks until the endpoint is fully deployed
)

The initial_instance_count is your baseline. It’s the number of instances that will be running 24/7, waiting for traffic. Set it to 1 for dev, but always set it to at least 2 for production. Why? Because if that one lonely instance decides to take a nap (or AWS performs a mandatory hardware update), your endpoint goes down. Two instances provide high availability out of the gate.

Configuring Auto Scaling: Because Psychic Infrastructure Isn’t a Thing

Your baseline instances are like the bouncers always on duty at a club. But what happens when a sudden celebrity tweet causes a line around the block? You don’t manually call more bouncers; you have a plan. That’s Auto Scaling.

SageMaker uses Application Auto Scaling, which lets you scale based on a target metric. The most sensible one is InvocationsPerInstance. You’re basically telling SageMaker, “Hey, keep the average number of requests per instance around this value.” It’s far more intuitive than trying to guess CPU usage for a machine learning inference workload.

Here’s how you set it up programmatically. Notice we’re not using the SageMaker SDK here; we’re using the boto3 library because, frankly, this is lower-level infrastructure stuff.

import boto3

client = boto3.client('application-autoscaling')

# Register your endpoint as a scalable target
resource_id = 'endpoint/my-cat-hat-endpoint/variant/AllTraffic'  # This is a required format

client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=2,  # Can't scale below your HA baseline
    MaxCapacity=10  # Your wallet's panic button
)

# Now, put the scaling policy
client.put_scaling_policy(
    PolicyName='MyScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000.0,  # Aim for 1000 invocations per instance, per minute
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleOutCooldown': 60,  # Wait 60 seconds before scaling out again
        'ScaleInCooldown': 300   # Wait a conservative 300 seconds before scaling in
    }
)

Why those cooldowns? Scale-out cooldown is short (e.g., 60 seconds) because when traffic is spiking, you need capacity now. Scale-in cooldown is longer (e.g., 300 seconds) because you don’t want to rapidly scale in, then immediately scale out again for a temporary traffic dip. It’s expensive and stressful for the system. A longer scale-in cooldown adds stability.

The Multi-Model Endpoint (MME) Mind-Bender

Now for the real party trick. What if you have hundreds of cat-hat models, each for a different user or region? Deploying each as a separate endpoint would be financially suicidal. Enter the Multi-Model Endpoint (MME).

An MME is a single endpoint that can load and serve many models, all from a shared underlying infrastructure pool. The models are stored in S3. When the first request for a specific model (e.g., model-user-123.tar.gz) comes in, SageMaker loads it into the container’s memory on one instance. The next request for that same model is blazingly fast. If the model isn’t used, it’s eventually unloaded (LRU cache style).

The code to deploy is almost identical, but you point to a directory of models in S3, not a single file.

from sagemaker.multidatamodel import MultiDataModel

# Define the MME. The 'model' definition here is mostly about the container.
mme = MultiDataModel(
    name='my-multi-model-endpoint',
    model_data_prefix='s3://my-bucket/models/',  # The folder where ALL models live
    model=model,  # The model object from the first code block (defines the container)
    sagemaker_session=sagemaker.Session()
)

# Deploy the endpoint itself
predictor = mme.deploy(
    initial_instance_count=2,
    instance_type='ml.m5.xlarge',
    endpoint_name='my-multi-model-endpoint'
)

# To add a model, you just copy it to the S3 prefix!
# A request then targets a specific model: predictor.predict(data, target_model='model-user-123')

The Gotcha: MMEs are brilliant for cost savings when you have many infrequently-used models. But be warned: that first request per model has significant latency as it loads the model from S3 (the “cold start” problem). It’s a trade-off. Use it for workloads where a few hundred milliseconds of occasional latency are acceptable.

The Pitfalls They Don’t Tell You About

Cold Starts Aren’t Just for MMEs: Even a standard endpoint has a cold start when it scales out. A new instance takes minutes to provision, initialize the container, and load your model. Your scaling policy needs to be proactive enough to handle rising traffic before your existing instances are overwhelmed.
The Default CloudWatch Metrics Are Lame: The built-in Invocations metric has a 1-minute granularity. For rapid scaling, this is an eternity. Enable the SageMakerVariantInvocationsPerInstance metric and publish it at a 1-second granularity in your model code for truly responsive scaling. It’s more work, but it’s what the pros do.
You Pay for Everything, Even Idle: An endpoint’s baseline instances cost money every second they’re running, even if they’re handling zero requests. If you have a dev endpoint, delete it when you’re done. For production, if you have predictable periods of zero traffic (e.g., overnight), use scheduled scaling to reduce the baseline count and save a small fortune.

The goal is to build a system that feels infinitely scalable to your users but isn’t infinitely expensive for you. It requires tuning, monitoring, and accepting that AWS, for all its power, won’t do all the thinking for you. Now go deploy that masterpiece. The internet’s cats in hats are counting on you.