29.2 Standard vs Express Workflows: Durability and Cost Trade-offs
Right, so you’ve decided to build a workflow, and AWS has handed you two different tools for the job: Standard and Express. This isn’t just a “pick one” scenario; it’s a fundamental architectural choice between durability and speed (and cost). Getting it wrong can either light your money on fire or leave you with a workflow that’s about as reliable as a chocolate teapot. Let’s break it down so you can make the right call.
The Core of the Matter: Exactly-Once vs. At-Least-Once
This is the big one, the philosophical divide. It dictates everything else.
Standard Workflows are the pedantic, meticulous librarians of the world. They are exactly-once. When a state (a step in your workflow) completes, that fact is durably recorded before the next one starts. If your execution fails halfway through, you can go back, inspect the exact state it failed in, and resume from there. It’s like a save point in a video game. This is fantastic for things you absolutely cannot mess up, like processing a financial transaction or fulfilling an order. The trade-off? All that journaling takes time.
Express Workflows are the caffeinated, optimistic interns. They are at-least-once and often asynchronous. They blaze through tasks at breakneck speed, but if something fails, they just retry from the beginning. There’s no built-in state history to inspect. You’d use this for things where speed is paramount and idempotency is your responsibility—like processing a high-volume data stream where processing the same piece of data twice is acceptable, or kicking off a fan-out of parallel tasks.
The Cost Equation: Per Transition vs. Per Duration
The pricing models are completely different, and this is where people get blindsided.
Standard charges you per state transition. You pay for each little step (“Task,” “Choice,” “Succeed,” etc.) that your workflow executes. The execution duration itself is free. This is great for long-running, human-approved workflows that might sit waiting for days. You only pay for the “clicks.”
Express, on the other hand, charges you for the total duration of your workflow’s execution, measured in GB-seconds (just like Lambda). It’s cheap and fast for short-lived workflows, but if you have one that runs for hours, the bill can add up fast. It’s optimized for “get in, do the work, get out.”
Let’s make this concrete. Imagine a simple workflow: a Lambda function that processes an image.
# Standard Workflow Definition (CDK Python snippet)
from aws_cdk import (
aws_stepfunctions as sfn,
aws_stepfunctions_tasks as tasks,
aws_lambda as _lambda
)
process_image_lambda = _lambda.Function(self, "ProcessImage", ...)
definition = tasks.LambdaInvoke(self, "Process Image Task",
lambda_function=process_image_lambda,
# Critical: For Standard, the output is passed through as-is.
payload_response_only=True
)
sfn.StateMachine(self, "MyStandardWorkflow",
definition=definition,
state_machine_type=sfn.StateMachineType.STANDARD, # Explicitly set to STANDARD
timeout=Duration.hours(24) # Can run for up to a year!
)
# Express Workflow Definition (CDK Python snippet)
from aws_cdk import (
aws_stepfunctions as sfn,
aws_stepfunctions_tasks as tasks,
)
definition = tasks.LambdaInvoke(self, "Process Image Task",
lambda_function=process_image_lambda,
# For Express, you MUST handle the response properly for async flows.
# payload_response_only=True is often NOT what you want.
)
sfn.StateMachine(self, "MyExpressWorkflow",
definition=definition,
state_machine_type=sfn.StateMachineType.EXPRESS, # Explicitly set to EXPRESS
timeout=Duration.minutes(5) # Max timeout is 5 minutes!
)
Notice the huge difference in timeout. Standard can run for a year. Express gets five minutes. That should tell you everything about their intended use cases.
Execution History: Built-in vs. Bring-Your-Own
With Standard, you get a gorgeous, detailed, and immutable execution history out of the box in the AWS Console. It’s a godsend for debugging. Express… well, Express just gives you a high-level summary. If you want to know what actually happened inside an Express workflow, you have to instrument it yourself, usually by logging to CloudWatch. It’s a significant operational difference.
When to Use Which (The Real-World Guide)
Use Standard for:
- Mission-critical business processes: Order fulfillment, data pipeline orchestration, anything where you need an audit trail.
- Long-running workflows: Anything that needs more than 5 minutes or involves human interaction (waiting for an approval email).
- When you need to debug complex failures: The built-in execution history is a non-negotiable feature for complex systems.
Use Express for:
- High-volume, event-processing workloads: Processing thousands of files uploaded to S3, handling IoT data streams.
- Massively parallel “fan-out” tasks: Kicking off thousands of short-lived, idempotent jobs in parallel. It’s brutally efficient at this.
- Short-lived orchestration inside a larger application: Coordinating a few Lambda functions that must complete in under a minute.
The Biggest Pitfall: Using Express for something that isn’t idempotent. If your “Process Payment” task gets run twice because of an at-least-once delivery, you’re going to have a very bad day and a very angry customer. The responsibility for ensuring safety is on you, not the service. Always ask: “Is it okay if this runs more than once?” If the answer is anything but a resounding “yes,” you probably want Standard.
So, choose wisely. Want a robust, auditable, “slow and steady” process? Standard. Need raw speed and throughput for idempotent tasks? Express. It’s that simple, and getting it right is what separates a clean architecture from a future headache.