30.2 SQS Visibility Timeout, Dead-Letter Queues, and Long Polling

Right, let’s talk about the part of SQS where the rubber meets the road and, occasionally, catches fire. You’ve got messages flowing into your queue. Great. But consuming them reliably is where most of the “fun” begins. It’s not just about grabbing a message; it’s about the delicate dance of acknowledging you’ve handled it, and what happens when you, frankly, screw it up.

The Visibility Timeout: Your “Do Not Disturb” Sign

When your consumer pulls a message from an SQS queue, that message doesn’t just vanish into the ether. Why? Because SQS assumes you might fail. Your EC2 instance might get terminated mid-processing, your Lambda might time out, your code might throw a NullPointerException because of that one guy on your team who refuses to use optional chaining.

To prevent other consumers from grabbing the same message and processing it again while the first one is still working on it, SQS slaps an invisibility cloak on it. This is the Visibility Timeout. It’s a period (from 0 seconds to 12 hours) during which SQS makes that message invisible to all other consumers.

Think of it as checking a book out of the library. Other people can’t read it until you return it (delete it) or your loan period expires (the visibility timeout).

You set this at the queue level, but the real power is that you can change it on the fly for a specific message. This is your escape hatch.

import boto3

sqs = boto3.client('sqs')
queue_url = 'https://queue-url'

# Receive a message
response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=1,
    VisibilityTimeout=30  # Start with a 30-second timeout
)

message = response['Messages'][0]
receipt_handle = message['ReceiptHandle']

# Uh oh, this is a big job. Need more time!
# Instead of letting it timeout and cause a duplicate, extend the timeout.
try:
    process_monolithic_message(message)
    sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=receipt_handle)
except StillProcessingException:
    # Tell SQS, "Hey, I'm still working, give me another 60 seconds!"
    sqs.change_message_visibility(
        QueueUrl=queue_url,
        ReceiptHandle=receipt_handle,
        VisibilityTimeout=60
    )

The biggest pitfall? Setting your visibility timeout too short. If your consumer takes 45 seconds to process a message but your timeout is 30 seconds, SQS will proudly re-release that message to another consumer, who will start working on it. Now you’ve got two workers doing the same job. At best, you waste money. At worst, you corrupt data. Measure your 99th percentile processing time and add a healthy buffer.

Dead-Letter Queues: The Message Hospice

Sometimes, a message is just cursed. Maybe it’s malformed, or it references a database record that no longer exists. No matter how many times you try, processing it fails. Without intervention, this message would bounce between your consumers forever, wasting resources and cluttering your metrics.

Enter the Dead-Letter Queue (DLQ). A DLQ is just a totally normal SQS queue that you designate to act as a hospice for these terminally ill messages. You define a redrive policy on your main queue (the source queue) that says: “If a message fails to be processed (i.e., is received) more than maxReceiveCount times, please move it to this other queue.”

This is arguably one of the best sanity-preserving features in AWS. It isolates the bad messages so your main queue can keep humming along. You can then have a separate, low-bandwidth process to inspect the DLQ, figure out why the messages are failing, and fix the issue at the source.

# This is infrastructure code (e.g., CloudFormation, Terraform), not runtime code.
# You're defining the queue properties.

SourceQueue:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: MySourceQueue
    RedrivePolicy:
      deadLetterTargetArn: !GetAtt MyDLQ.Arn
      maxReceiveCount: 3 # After 3 tries, send it to the hospice.

MyDLQ:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: MyDLQ

The best practice here is simple: ALWAYS USE A DLQ. No excuses. Set the maxReceiveCount to 3 or 5. It’s your canary that tells you when something is deeply wrong with your message producer or your consumer logic.

Long Polling: Stop Burning Money and CPU

By default, when you ask SQS for messages (ReceiveMessage call), it performs a short poll. It checks a subset of its servers (not all of them) and returns immediately, even if the queue is empty. This is why you sometimes get empty responses. If you’re constantly polling an empty queue, you’re making countless API calls for no reason, burning CPU cycles, and driving up your AWS bill for the privilege of getting nothing done.

Long polling fixes this idiocy. You set a WaitTimeSeconds parameter of up to 20 seconds. When you do this, you’re telling SQS: “Hey, I’m willing to wait. If there are no messages right now, don’t immediately come back empty-handed. Hold the connection open for up to 20 seconds and only return something if a message actually arrives in that time.”

This does two brilliant things:

It drastically reduces the number of empty ReceiveMessage calls you make, saving you money.
It reduces latency. The message is delivered to your consumer almost immediately after it’s sent, instead of having to wait for your next short poll.

There is literally no downside. Always use long polling. It’s one of the easiest performance optimizations you’ll ever make.

# The right way to ask for messages. Be patient.

response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10, # Get up to 10 at a time for efficiency
    WaitTimeSeconds=20,     # IMPORTANT: Enable long polling
    VisibilityTimeout=30    # Set your sensible visibility timeout
)

# Your code will wait here for up to 20 sec for messages to appear.
if 'Messages' in response:
    for msg in response['Messages']:
        # do work

The only “gotcha” is that your application needs to be built to hold a connection open for this period. For most modern applications, this is a non-issue. If you’re using Lambda, just make sure your function timeout is longer than your WaitTimeSeconds.