41.8 Bedrock Pricing: On-Demand vs Provisioned Throughput

Right, let’s talk money. Because as much as I love playing with billion-parameter AI models, I’m not the one paying Amazon’s AWS bill, and I’m guessing you are. Bedrock’s pricing model is actually one of its better features—it’s designed to be flexible, but that flexibility means you have a choice to make: pay as you go, or commit like you’re in a serious relationship. Let’s break down the two modes so you don’t end up with a bill that makes you gasp.

On-Demand: The First Date

This is the default, no-strings-attached mode. You send a request, Bedrock processes it, and you pay a fixed rate per thousand input characters and per thousand output tokens. It’s perfect for exploration, prototyping, and low-volume or unpredictable workloads. There’s no commitment, no upfront cost. You’re just dating the model.

The pricing varies wildly by model. For example, as of this writing, using Anthropic’s Claude Instant is a cheap date, while charging a request to Jurassic-2 Ultra feels like taking the model to a Michelin-star restaurant. Always check the latest pricing page before you build, because nothing is more heartbreaking than building a brilliant prototype only to find out it’s financially unsustainable.

Here’s how you’d typically use it in code. You don’t have to do anything special; just call the model.

import boto3
import json

client = boto3.client('bedrock-runtime', region_name='us-east-1')

body = json.dumps({
    "prompt": "\n\nHuman: Explain the theory of relativity like I'm a witty, curious golden retriever.\n\nAssistant:",
    "max_tokens_to_sample": 300,
    "temperature": 0.5,
})

response = client.invoke_model(
    modelId='anthropic.claude-instant-v1',
    body=body
)

response_body = json.loads(response.get('body').read())
print(response_body.get('completion'))
# And somewhere, Amazon's billing system quietly ticks over by $0.0001.

The beauty here is the sheer lack of friction. The pitfall is the sheer lack of cost control at high volumes. If your dog-explains-physics app goes viral, your on-demand bill will go viral right along with it, in the worst way possible.

Provisioned Throughput: The Marriage

Okay, maybe it’s a business partnership, not a marriage, but the level of commitment is similar. Provisioned Throughput is for when you know you’re going to be making a lot of calls, consistently. You commit to a certain number of Model Units—a slightly abstract metric Amazon invented that typically represents a throughput of tokens per minute—for a one-year term. In return, you get a massive discount, often 70-80% cheaper than On-Demand rates.

You’re buying a dedicated slice of the model’s capacity. This is what you use for production workloads where you have predictable, steady-state traffic. The catch? You pay for those Model Units every hour, whether you use them or not. It’s like leasing a dedicated server instead of using serverless functions. If your traffic has a massive spike, you can’t exceed your provisioned capacity without enabling auto-scaling (which costs more), and if your traffic dips, you’re still paying for the empty chair at the table.

Here’s the kicker: using it is almost identical to on-demand. You just specify your provisioned model ARN when you invoke it. The complexity is all in the setup and purchase within the AWS console.

# This code looks identical, but the 'modelId' points to your provisioned capacity
provisioned_response = client.invoke_model(
    modelId='arn:aws:bedrock:us-east-1:123456789012:provisioned-model/abc123def456', 
    body=body
)

# This call now draws from your pre-paid, committed pool of Model Units.
# The financial anxiety is lower, but the commitment anxiety is higher.

The Critical Choice: Modeling Your Usage

The biggest mistake I see is teams either over-provisioning (“We’ll definitely need 100 model units!”) and wasting money, or under-provisioning and throttling their own application. You need to model your expected usage.

Prototype with On-Demand: Use On-Demand to build your application and establish a baseline. Monitor your usage in CloudWatch.
Calculate Your TPM: Figure out your average and peak Tokens-Per-Minute requirement. This is the key metric.
Run the Numbers: Use the AWS Pricing Calculator. Plug in your On-Demand usage and compare it to the cost of a Provisioned Throughput commitment. The discount is huge, but only if you actually use what you pay for.
Consider a Blended Approach: For many real-world applications, the right choice is both. Use Provisioned Throughput for your baseline, predictable traffic, and have a fallback mechanism to On-Demand for handling unexpected traffic spikes. This requires more architectural thought but can be the most cost-effective and resilient setup.

The designers got this mostly right. The pay-as-you-go model lowers the barrier to entry, which is fantastic. The commitment model makes serious production work economically feasible. The questionable choice? Burying the actual process of purchasing Provisioned Throughput in the AWS console where it feels like you’re configuring a VPC rather than buying a product. It’s clunky. But hey, it’s AWS. You weren’t expecting a beautifully designed UI, were you? You’re here for the power, not the polish. Now go use it wisely.