29.7 Step Functions Distributed Map: Processing Millions of Items in S3
Alright, let’s talk about the Step Functions Distributed Map. You’ve got a mountain of data sitting in S3—millions of JSON files, CSV blobs, you name it. Your job is to process all of it. Your first thought might be to fire up a massive Lambda function that lists all the objects and then processes them in a loop. Don’t. You’ll hit Lambda’s execution timeout faster than I hit the snooze button on Monday morning. Even if you could, you’d be processing one file at a time. That’s like using a toothpick to empty a swimming pool.
The Distributed Map state is your industrial-grade excavator for that swimming pool. It’s designed to fan out and process a massive number of items in parallel, and I mean massive. We’re talking scales that would make most other services weep. It does this by dynamically launching and managing an entire fleet of child workflows for you. You don’t provision a thing; you just point it at your data and tell it what to do with each item.
How It Actually Works: The Magic of Fan-Out
You start by giving it a CSV or JSON manifest file in S3, or you can just point it at an entire S3 prefix. I prefer the manifest file—it gives you more control and avoids the potential weirdness of listing operations on giant buckets.
Here’s the basic structure of a state machine that uses it. Notice the ItemReader and the Iterator—this is where the magic is configured.
{
"Comment": "Process a gazillion S3 items",
"StartAt": "ProcessAllTheData",
"States": {
"ProcessAllTheData": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": {
"Bucket": "my-massive-data-bucket",
"Prefix": "raw-data/2023-10-05/"
}
},
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "STANDARD"
},
"StartAt": "ProcessOneFile",
"States": {
"ProcessOneFile": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:MyProcessorFunction",
"Payload": {
"s3Bucket.$": "$.Bucket",
"s3Key.$": "$.Key"
}
},
"End": true
}
}
},
"End": true
}
}
}
When this state runs, Step Functions doesn’t process the items itself. Instead, it becomes a glorified—and incredibly efficient—orchestrator. It uses the ItemReader to get the list of S3 objects. Then, for each object, it launches a completely separate, short-lived child execution of the workflow you defined under ItemProcessor. Each child execution gets one single item from your list. This is the core of the distributed model: one execution per item, all running in parallel.
Taming the Beast: Concurrency and Throttling
This raw power demands respect. If you point this thing at a bucket with 10 million files, its first instinct is to try and launch 10 million child executions simultaneously. This will not end well. You’ll slam into service quotas, and your bank account might spontaneously combust.
This is where MaxConcurrency is your best friend. It’s a hard limit on the number of child executions running at any one time. Always, always set this to a sane number.
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "STANDARD",
"MaxConcurrency": 1000
}
You can also use ToleratedFailurePercentage and ToleratedFailureCount to prevent a few bad apples (failing items) from stopping the entire massive job. This is crucial for robustness.
The Gotchas: Where They Get You
- The Payload Size Limit: This is the big one. The payload passed to each child execution—which includes your S3 item details and any constant values—cannot exceed 256 KB. If your manifest items are huge (they usually aren’t), you’re hosed. The error is cryptic, too. Your child executions will fail with a
States.Runtimeand you’ll be left scratching your head. - Cold Start Symphony: Each child execution is a separate state machine execution. If your
ItemProcessorworkflow uses Lambda, every one of those thousands of concurrent executions could face a cold start. Your processing latency will be dominated by this. There’s no way around it; it’s the nature of the serverless beast. - Cost: You pay for each child execution. A map processing 1 million items will generate 1 million child state machine executions, plus the parent. The cost isn’t exorbitant, but it’s not zero. Do the math before you run it on your production bucket. Use
ExecutionType: STANDARDfor most things, but if you’re doing a truly astronomical run and need to save a few bucks,ExecutionType: EXPRESSfor the child workflows is an option. Just know that EXPRESS executions have a five-minute limit and lack some of the visibility of STANDARD. - Debugging: How do you find one failed execution among 100,000? You use CloudWatch Insights and a lot of patience. It’s not pretty. The parent execution’s graph inspector becomes a useless sea of green with a few random red dots. You have to drill into each failed child to see what happened.
A More Robust Manifest-Based Approach
Because of the payload limit and for more control, I almost always use a manifest. I have a separate Lambda function that generates a manifest file of all the items that need processing and saves it to S3. Then my Distributed Map reads from that.
"ItemReader": {
"Resource": "arn:aws:states:::s3:getObject",
"Parameters": {
"Bucket": "my-manifest-bucket",
"Key": "job-1234-manifest.json"
},
"ReaderConfig": {
"InputType": "JSON"
}
}
And the manifest.json would look like a simple list:
{
"items": [
{"Bucket": "my-massive-data-bucket", "Key": "raw-data/2023-10-05/file1.json"},
{"Bucket": "my-massive-data-bucket", "Key": "raw-data/2023-10-05/file2.json"},
// ... a few million more lines
]
}
This approach is cleaner, avoids S3 listing oddities, and gives you explicit control over exactly what gets processed.
The Distributed Map is an incredibly powerful tool. It takes a problem that was previously in “build a custom distributed system” territory and makes it a simple configuration choice. Just respect its power, mind the concurrency, and always, always check your payload sizes. Now go process those millions of files. You’ve got this.