29.5 Error Handling: Retry and Catch

Right, so you’ve built this beautiful, elegant state machine. It’s a masterpiece of logic, a symphony of Task states. And then you deploy it. The real world hits. An API times out. A Lambda throttles. A third-party service returns {"status": "¯\_(ツ)_/¯"}. Your perfect workflow grinds to a halt. This is where we move from drawing pretty graphs to engineering resilient systems. Error handling isn’t an add-on; it’s the feature.

Step Functions gives you two primary, brilliantly straightforward tools for this: Retry and Catch. They are the yin and yang of not having your workflow explode.

The Art of the Retry

A Retry policy is your first line of defense. It’s the state machine saying, “I’m sure it’s just a hiccup.” You attach a Retry policy directly to a state (like a Task state) to tell the workflow “if this specific type of error happens, try again, and here’s how.”

Let’s look at a policy. This one is for a Lambda function that might occasionally be throttled.

"CheckInventory": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:check-inventory",
  "Retry": [
    {
      "ErrorEquals": [
        "Lambda.ServiceException",
        "Lambda.AWSLambdaException"
      ],
      "IntervalSeconds": 1,
      "BackoffRate": 2,
      "MaxAttempts": 3
    }
  ],
  "Next": "ProcessResults"
}

Let’s break this down because the defaults are, frankly, useless and will burn you.

ErrorEquals: This is crucial. You’re listing the exact error strings you want to catch. States.ALL is a thing, but using it is like catching a falling anvil with a butterfly net—it’s the wrong tool and will make a mess. Be specific. Lambda.ServiceException covers throttling and other service issues. You can also catch your own custom errors thrown by your Lambda (e.g., "InventoryNotFound").
IntervalSeconds: The first delay before the next retry. 1 second is a sane starting point.
BackoffRate: This is the multiplier. This is where the magic (and common pitfalls) happen. With a rate of 2, your retries will wait 1s, then 2s, then 4s. This is called exponential backoff, and it’s vital for not overwhelming a struggling service. The pitfall? People set a high MaxAttempts with a high BackoffRate and don’t realize their workflow will be stuck retrying for days. Do the math: 1 + 2 + 4 + 8 + 16 + 32 + … it adds up fast.
MaxAttempts: This includes the original attempt. So "MaxAttempts": 3 means one original try and up to two retries. Not three retries. This trips everyone up. Set it to 1 if you want no retries, which is sometimes the right call.

The most common mistake I see? Developers set a Retry on States.ALL with a huge MaxAttempts. Congratulations, you’ve just built a workflow that will spin its wheels for hours on a fundamental logic error (like a bad input payload) that will never succeed. Always, always be specific in your ErrorEquals.

The Grace of the Catch

When retries are exhausted (or immediately for errors you don’t want to retry), you use a Catch. A Catch is a non-linear jump. It’s the state machine saying, “Okay, that didn’t work. Let’s change the plan entirely.” You route the error to a different state to handle the failure gracefully.

"ProcessPayment": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:process-payment",
  "Retry": [
    {
      "ErrorEquals": ["Lambda.ServiceException"],
      "MaxAttempts": 2
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["PaymentFailed"],
      "Next": "NotifyCustomerOfPaymentFailure"
    },
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "EscalateToHuman"
    }
  ],
  "Next": "ShipProduct"
}

Here, we retry service errors twice. If it finally fails, or if our Lambda throws a custom PaymentFailed error (like declined credit card), it’s caught. The first Catch block handles the specific business logic failure by routing to a state that notifies the user. The second Catch is our safety net—it catches any other error we didn’t anticipate and routes it to a human for investigation.

Critical Insight: The Next field in a Catch block is mandatory and must point to a valid state. This is your recovery path.

The Devil in the Details: Error Input

Here’s the best part, the thing that makes this genuinely powerful. When a Catch block is triggered, the state output becomes the error information. It’s not a silent jump. The state you jump to (like NotifyCustomerOfPaymentFailure) receives a payload that looks like this:

{
  "Error": "PaymentFailed",
  "Cause": "{\"errorMessage\": \"Card declined: insufficient funds.\", \"errorType\": \"PaymentFailed\", \"stackTrace\": [...]}"
}

Your recovery state can use this information! The NotifyCustomerOfPaymentFailure state could be another Lambda that parses the Cause field to get the specific error message and send a tailored notification to the user. This turns a simple failure into an intelligent, context-aware recovery process.

The Golden Rule: Always Catch States.ALL at the End

My non-negotiable best practice: On any state that has Retry policies, your final Catch block should always have an "ErrorEquals": ["States.ALL"] to act as a final, all-encompassing safety net. Without it, an unanticipated error will cause the entire execution to fail. A failed execution is a dead end. A caught error is a detour. Your goal is never to fail; it’s to handle.

Think of Retry as your optimism—“maybe it’ll work this time.” Think of Catch as your pragmatism—“okay, plan B.” Together, they transform your workflow from a fragile script into a robust, self-healing system that can handle the chaos of the real world.