30.8 SQS Lambda Triggers: Batch Size, Parallelization, and Error Handling

Alright, let’s talk about one of the most powerful yet misunderstood features in the AWS event-driven toolkit: triggering a Lambda function from an SQS queue. This isn’t your granddad’s HTTP endpoint; it’s a workhorse designed for high-throughput, asynchronous processing. But to use it effectively, you need to understand the knobs and levers. AWS gives you a few, and they matter. A lot.

The Almighty Batch Size and How It Controls Your Wallet

When you hook a Lambda function to an SQS queue, the Lambda service doesn’t just grab one message at a time. That would be pathologically inefficient and, frankly, a bit silly. Instead, it performs a ReceiveMessage call on your behalf, asking for up to a certain number of messages. That “up to” number is your batch size.

Think of it as your grocery bag. You can set the batch size to 10, meaning you’re willing to carry a bag with 10 items (messages). But if the store (SQS) only has 3 items ready, you’re getting a bag with 3. The maximum is 10, but the minimum is 1.

Why would you ever set this to 1? You probably wouldn’t. The whole point is to amortize the cold start and invocation overhead of your Lambda function over multiple messages. Processing 10 messages in one Lambda invocation is almost always cheaper and faster than processing them in 10 separate invocations. The sweet spot is usually between 5 and 10 for most workloads, but if your messages are tiny and processing is quick, crank it up to the max (10,000, but let’s be serious, you’re not doing that).

Here’s the CloudFormation for a trigger. Notice the BatchSize property.

MyLambdaEventSourceMapping:
  Type: AWS::Lambda::EventSourceMapping
  Properties:
    BatchSize: 10
    EventSourceArn: !GetAtt MySQSQueue.Arn
    FunctionName: !Ref MyLambdaFunction

Your Function’s New (Batch) Shape

This changes the contract of your Lambda function. It’s no longer called with a single event object; it’s called with a batch of them. Your function signature needs to expect an array of SQS message objects.

Here’s what your Python handler should look like. Notice how it loops through event['Records'].

def lambda_handler(event, context):
    for record in event['Records']:
        # Get the message body
        message_body = record['body']
        # Do your actual processing work here
        try:
            process_message(message_body)
        except Exception as e:
            # We'll talk about what to do here in a second...
            logger.error(f"Failed to process {message_body}: {str(e)}")
            raise e

    # If we get here, we successfully processed all messages in the batch!
    return {
        'statusCode': 200,
        'batchItemFailures': []
    }

The Brutal Reality of Batch Failure

This is where most people get it wrong, so pay attention. By default, if any single message in your batch causes your function to throw an error, the entire batch is considered a failure. Lambda tells SQS, “Hey, I couldn’t process any of these,” and SQS makes all messages in the batch visible in the queue again after the visibility timeout expires.

This is catastrophically stupid for most use cases. Let’s say you have a batch of 10 messages. Message #7 is malformed and throws an exception. Messages #1-6 and #8-10 were perfectly fine and already processed, but now they’re all going to be redelivered. Your function will process messages #1-6 again, likely causing duplicates in your system. This is a fantastic way to waste money and create data integrity nightmares.

AWS finally, mercifully, added a solution: Report Batch Item Failures. This feature lets your function tell Lambda exactly which messages failed, so only those specific messages are retried. It’s a game-changer.

To use it, your function needs to return a specific JSON structure listing the message IDs that failed. Crucially, your function must still exit successfully (i.e., not throw an exception). You have to catch your errors and handle them gracefully.

Here’s the updated, much smarter handler:

def lambda_handler(event, context):
    batch_failures = []
    for record in event['Records']:
        try:
            process_message(record['body'])
        except Exception as e:
            # Mark this specific message as failed
            batch_failures.append({"itemIdentifier": record['messageId']})

    # If we have any failures, we return them.
    # If the list is empty, Lambda knows the whole batch was a success.
    return {
        'batchItemFailures': batch_failures
    }

Concurrency and the Beautiful Mess of Parallelism

You’ve got a queue with a million messages. How fast will they be processed? That’s controlled by two things: your batch size and Lambda’s reserved concurrency for that function.

The Lambda service will automatically poll your SQS queue and invoke your function with batches of messages. It will keep doing this, increasing the number of concurrent invocations, until one of two things happens:

The queue is empty.
It hits your function’s reserved concurrency limit.

This is fantastic. It means your system can automatically scale to chew through a backlog. Need to process faster? Increase the concurrency limit (and your wallet’s tolerance). Need to throttle processing to not overwhelm a downstream API? Lower the concurrency limit. You have a simple, powerful dial to control throughput.

The Pitfalls They Don’t Tell You About

Poison Pills: A message that always fails. Without a dead-letter queue (DLQ) configured on your SQS queue, that message will be retried until the end of time, wasting cycles and potentially blocking other messages. Always set up a DLQ and a sensible maxReceiveCount (like 5) on your source queue.
Visibility Timeout: This one bites people. Your Lambda function’s timeout must be less than the queue’s visibility timeout. If it’s not, a message could be visible again in the queue while your function is still trying to process it, leading to a duplicate delivery. Make your visibility timeout at least 6 times your function timeout. It’s a weird rule, but just do it.
Partial Success is Your Job: The Report Batch Item Failures feature is powerful, but you have to implement it. The naive “throw an error on failure” behavior is a trap. Don’t fall into it.