40.6 SageMaker Batch Transform: Offline Inference at Scale

Alright, let’s talk about Batch Transform. You’ve trained a beautiful model, it’s sitting there in its model.tar.gz file like a prized possession. Now you need to run predictions—not on one image or one row of data, but on a terabyte of the stuff. You don’t need a live, always-on endpoint slurping power and money; you have a big pile of data, you want a big pile of predictions. This is Batch Transform’s raison d’être. It’s the workhorse, the quiet, efficient factory that takes in a pallet of raw materials and spits out a pallet of finished goods. No frills, no web server, just pure, unadulterated, offline inference.

Think of it as the antithesis of a persistent endpoint. You’re not paying for an instance to sit around waiting. You’re paying for the precise number of seconds of compute time it takes to churn through your dataset. It spins up the instances, does the job, and then, crucially, spins them down. It’s the embodiment of “turn the lights off when you leave.” For large, periodic inference jobs—overnight processing, end-of-month reports, generating training data for another model—it’s almost always the right and most cost-effective tool.

The Core Components of a Batch Transform Job

You’ll need to provide SageMaker with a few key ingredients. Let’s break them down.

The Model: This isn’t the training artifact; this is the serialized model you created after training, bundled with your inference script (inference.py) and any dependencies. SageMaker needs to know how to load it and run it. You point to it via its S3 URI.

The Data: Your input data, sitting in S3. SageMaker will efficiently stream this data to the instances, so you don’t have to download the entire massive dataset to each instance’s disk first. It’s smart about it.

The Output Path: Simply an S3 bucket location where you want the results dumped. SageMaker will handle all the permissions and writing for you.

The Instance Type and Count: Here’s where you can get clever. You can specify InstanceCount: 5 and SageMaker will split your dataset into roughly five chunks and process them in parallel. This is the key to scaling horizontally. Choosing the right instance (e.g., GPU vs. CPU, memory size) is crucial for cost and speed.

A Realistic Code Example

Let’s say we have a Scikit-Learn model we’ve trained and saved as model.joblib. Here’s how we’d package it up and run a batch job. First, the packaging. You need an inference.py script. This is non-negotiable.

# inference.py
import joblib
import json
import pandas as pd

def model_fn(model_dir):
    """Load the model from the model_dir"""
    model = joblib.load(f"{model_dir}/model.joblib")
    return model

def input_fn(request_body, request_content_type):
    """Parse the input request body"""
    if request_content_type == 'application/json':
        data = json.loads(request_body)
        # Assuming data is a list of records
        return pd.DataFrame(data)
    else:
        raise ValueError(f"Unsupported content type: {request_content_type}")

def predict_fn(input_data, model):
    """Make predictions"""
    return model.predict(input_data)

def output_fn(prediction, response_content_type):
    """Format the prediction output"""
    if response_content_type == 'application/json':
        return json.dumps(prediction.tolist())
    else:
        raise ValueError(f"Unsupported content type: {response_content_type}")

Now, let’s package this script and our model file into the required.tar.gz format.

# Create a directory for the model artifacts
mkdir model-artifacts
cp inference.py model.joblib model-artifacts/

# Create the model.tar.gz
tar -czvf model.tar.gz -C model-artifacts/ .
aws s3 cp model.tar.gz s3://my-bucket/my-model-path/model.tar.gz

Finally, here’s the Python code using the Boto3 SDK to kick off the transform job.

import boto3
from sagemaker import session

sm_client = boto3.client('sagemaker')
sess = session.Session()

transform_job = sm_client.create_transform_job(
    TransformJobName='my-batch-transform-job-2023-10-27',
    ModelName='my-previously-created-sagemaker-model', # Or use ModelPackageName
    # Alternatively, you can specify the model artifacts directly:
    # ModelName='',
    # TransformResources={
    #     'InstanceType': 'ml.m5.large',
    #     'InstanceCount': 1
    # },
    MaxPayloadInMB=100, # Safety net for large requests
    BatchStrategy='MultiRecord', # Often faster than SingleRecord
    TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': 's3://my-bucket/input-data/'
            }
        },
        'ContentType': 'application/json', # Must match your input_fn!
        'SplitType': 'Line' # Crucial! One JSON object per line.
    },
    TransformOutput={
        'S3OutputPath': 's3://my-bucket/output/',
        'AssembleWith': 'Line' # Puts one output per input line
    },
    TransformResources={
        'InstanceType': 'ml.m5.4xlarge',
        'InstanceCount': 2 # Process data in parallel
    }
)

print(f"Job started: {transform_job['TransformJobArn']}")

The Devil’s in the Details: Pitfalls and Best Practices

SplitType: 'Line' is Your Best Friend: Your input data in S3 should be a file (or files) where each line is a self-contained JSON object (JSON Lines format). This allows SageMaker to split the file on line boundaries and distribute chunks to instances safely. Without this, you risk splitting a JSON object in half and watching your entire job fail with a parsing error. It’s the most common “why is this broken?” issue.
MaxPayloadInMB: This is a sneaky one. It’s the maximum size allowed for the response of a single request to your model. If you’re processing large images or massive rows of data, a batch of them might produce a response that exceeds the default 6 MB limit. Your job will fail. If your output is large, crank this value up. It’s a cheap insurance policy.
BatchStrategy - MultiRecord vs. SingleRecord: This controls how the data is fed to your input_fn. SingleRecord sends one line at a time. MultiRecord sends a batch of lines. MultiRecord is almost always what you want for efficiency, as it reduces the overhead of calling your model’s prediction function. Your input_fn needs to be able to handle a string containing multiple JSON lines.
Monitoring and Debugging: Don’t just fire and forget. Use CloudWatch. The ExecutionTime metric will tell you if your instances are under- or over-provisioned. Check the logs in CloudWatch Logs under /aws/sagemaker/TransformJobs if your job fails. The error messages are often… enlightening.
The Model vs. The Model Artifacts: Notice in the code we used ModelName. This refers to a SageMaker Model entity you created previously, which points to your S3 model.tar.gz and specifies the container to use. You can also point directly to the S3 URI of the model artifacts and specify the container image yourself, but using a named Model is cleaner and more reusable, especially when working with pipelines. It’s a bit of SageMaker indirection that ultimately makes sense.