30.5 Prompt Caching: Reducing Latency and Cost for Long Contexts
Right, let’s talk about one of the most powerful features you probably didn’t know you needed: prompt caching. Think of it like this. You’re sending a massive 100k context prompt to Claude, maybe a huge document for analysis followed by a set of instructions. You do this repeatedly in a loop, or across different user sessions. Every single time, you’re paying to process that entire document and you’re waiting for every single token to be processed. It’s like photocopying a book every time you want to ask a question about a specific page. It’s absurdly wasteful. Prompt caching is the three-hole punch and binder that fixes this.
The genius of it is brutally simple. You send your long, static, rarely-changing ‘document’ prompt once. The API returns a cache_creation_id—a UUID that is essentially a receipt for your cached content. On subsequent requests, you send that UUID instead of the massive text block. Claude’s systems then seamlessly inject the pre-processed, cached content into your new, much shorter prompt. The result? Drastically lower latency and, crucially, a significantly smaller bill, as you’re only charged for the tokens in your new, dynamic prompt and the generated output.
How to Actually Use It: The Code
It’s a two-step dance: creation and usage. Here’s how you do it in Python. First, you create the cache. Note the cache_control parameter; it’s your lever.
from anthropic import Anthropic
client = Anthropic(api_key="your_api_key")
# Step 1: Create the cache with your long, static content
create_response = client.cache.create(
prompt=f"\n\nHuman: Here is the entire, unabridged text of 'War and Peace'...\n\nAssistant:",
cache_control={"type": "ephemeral", "ttl": 300} # Cache lives for 5 minutes
)
cache_creation_id = create_response.id
print(f"Cache created. ID: {cache_creation_id}")
Now, with that cache_creation_id in hand, you can use it in a normal message request. You reference the cache and then provide your new, dynamic instruction.
# Step 2: Use the cache for a much faster, cheaper request
message_response = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1000,
system="You are a literature professor with a sharp wit.",
messages=[
{
"role": "user",
"content": [
{
"type": "cache_provider",
"cache_creation_id": cache_creation_id
},
{
"type": "text",
"text": "Based on the text I provided, summarize the character arc of Pierre Bezukhov in the style of a overly dramatic movie trailer voice."
}
]
}
]
)
print(message_response.content[0].text)
See what we did there? The messages array contains a cache_provider block pointing to our pre-processed Tolstoy, followed by a tiny text block with our actual question. We went from sending hundreds of thousands of tokens to sending a UUID and a single sentence. The latency drop is palpable.
Cache Control: Ephemeral vs. Persistent
This is a critical design choice. The API offers two types, and you must specify one.
Ephemeral (
"type": "ephemeral"): This is the default and, frankly, the one you’ll use 90% of the time. It’s a short-lived cache (you set the Time-To-Live withttlin seconds) that’s perfect for the duration of a single session or a short-lived conversation. It’s cheaper (or free, depending on your plan) but gets purged quickly. Use this for caches that are specific to one user’s current interaction.Persistent (
"type": "persistent"): This is for content that changes… never. Think your company’s internal documentation, a fixed knowledge base, or the complete works of Shakespeare. This cache sticks around until you delete it. This is where the real savings compound, but be warned: you are charged for persistent cache storage. It’s a fraction of a cent per million tokens per day, but you need to be mindful. Don’t cache yesterday’s news article persistently.
The Rough Edges and Pitfalls (Because Of Course There Are Some)
The designers didn’t make it perfectly obvious, so I will: caching is not a free-for-all. The cache is immutable. If you need to change a single comma in your cached document, you must create a brand new cache and get a new ID. The old one remains, useless and billing you if it’s persistent, until its TTL expires or you manually delete it. This is the biggest mental shift: your cached content is now a versioned artifact.
Another pitfall is cache misses. You can’t just blindly use a cache_creation_id forever. Your code must handle the error if the cache has expired (for ephemeral) or been deleted (for persistent). The API will return a 400 InvalidCacheCreationIDError and your request will fail. Robust code will catch this, recreate the cache, and proceed.
Best practice? Tag your caches. When you create a persistent cache, you can assign it a user_id or a document_version tag. This allows you to later list, manage, and clean up caches you’re no longer using. Without tags, you’re just staring at a list of UUIDs with no idea what they contain. It’s a nightmare.
# Create a cache with tags for later management
create_response = client.cache.create(
prompt="Your massive document here...",
cache_control={"type": "persistent"},
metadata={"document_id": "company_handbook_v2.1", "user_id": "api_backend"}
)
Finally, know what can’t be cached. The cache is for the prompt field in the cache.create call. Your system prompt and the new messages you send in the follow-up request are not cached and are processed fresh every time. This is by design; it keeps the dynamic part dynamic.
In short, if you’re dealing with long contexts and repeat requests, prompt caching isn’t just an optimization; it’s a financial and operational necessity. It turns a clunky, expensive process into something elegantly efficient. Just remember to manage your cache lifecycles like a gardener—planting, pruning, and weeding—or you’ll end up with a digital jungle of forgotten UUIDs.