Prompt caching with Claude API: cut costs by 90% on repeated contexts

The Problem: Context Tokens Are Killing Your Budget

You're building an application where Claude processes the same reference documents, code repositories, or system prompts repeatedly. A PDF guide. A codebase. A detailed specification. Every single request sends these tokens to Anthropic's servers again. And again. And again.

If you're processing the same 100KB document across 1,000 requests, you're paying for those tokens 1,000 times.

This is where most teams hemorrhage money on Claude API usage.

Enter prompt caching: a feature that lets you store frequently-used context on Anthropic's servers and reuse it for a fraction of the cost. We're talking 90% savings on cached tokens.

How Prompt Caching Works

Prompt caching is deceptively simple conceptually:

You send a request with a cache_control parameter on your messages
Claude processes those tokens normally, but also caches them server-side
Subsequent requests that include identical cached content pay 10% of the normal token cost
The cache persists for 5 minutes per default (up to 24 hours)

The math is straightforward. A cached token costs 10% of an input token. So if you're reusing 10,000 tokens across multiple requests, you save 9,000 token charges on each request after the first.

For teams running hundreds of daily requests against the same documents or system prompts, this translates to real money.

Setting Up Caching: The Code

Here's the minimal implementation using the Python SDK:

import anthropic

client = anthropic.Anthropic()

# Your expensive context - system prompt, docs, etc.
expensive_context = """
## Product Specification v2.1
[... 10,000 tokens of detailed spec ...]

## API Documentation
[... another 5,000 tokens ...]

## Style Guide
[... 2,000 more tokens ...]
"""

# First request - caches the context
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": expensive_context,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Summarize the API rate limits"
        }
    ]
)

print(response.usage.cache_creation_input_tokens)  # High number
print(response.usage.cache_read_input_tokens)      # 0

On the second request with identical cached content:

# Second request - uses cache
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": expensive_context,  # Identical - gets cached
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What are the authentication methods?"
        }
    ]
)

print(response.usage.cache_creation_input_tokens)  # 0
print(response.usage.cache_read_input_tokens)      # High number

That's it. The cache_control parameter does the heavy lifting. The ephemeral type keeps the cache for 5 minutes. For longer retention, use {"type": "ephemeral"} with explicit ttl values or plan for your own cache refresh strategy.

Real-World Scenarios Where Caching Pays Off

Document Processing at Scale: You're running customer support where each query references the same 20KB knowledge base. Across 500 daily queries, that's one cache hit per request. You go from 10M tokens charged to 1M.

Code Review Bot: Your GitHub integration sends the same repository context (README, architecture docs, style guide) with each pull request review. Cache that context once, and the review itself uses minimal tokens. First PR costs you 50K tokens. Second PR costs 5K.

RAG with Fixed Contexts: You're building a retrieval-augmented generation system. Your embeddings retrieve the same top-5 documents across 80% of queries. Cache those documents. The variation in user questions doesn't matter—the expensive retrieval results stay cheap.

Batch Processing: Processing a queue of 10,000 emails against the same classification ruleset. Cache the ruleset. Only the email content varies per request.

The Limits (and How to Work Around Them)

Cache Invalidation: Your specification updates. The cached version persists for 5 minutes. Either wait, or change a single character in your cached text to force a refresh. Not elegant, but effective.

Minimum Efficient Size: Caching adds minimal overhead, but there's a breakeven point. For context under 1,024 tokens that you use once, caching adds complexity without benefit. For context you'll reuse 3+ times within 5 minutes, caching wins.

Model and Message Changes: Any change to your cached prompt invalidates the cache. A different model, different message order, or tweaked system prompt? Fresh cache miss. Keep cached content stable and separate user input from cached context.

Cost Tracking: Your usage dashboard will show cache tokens separately. Make sure your finance team understands that cache_read_input_tokens aren't "missing" charges—they're savings.

Advanced Pattern: Structured Caching

Don't just cache your entire system prompt. Structure your context thoughtfully:

system_messages = [
    {
        "type": "text",
        "text": "You are a product specialist.",  # Small, rarely changes
    },
    {
        "type": "text",
        "text": stable_product_docs,  # Large, stable - CACHE THIS
        "cache_control": {"type": "ephemeral"}
    },
    {
        "type": "text",
        "text": "Today's date is " + today,  # Variable - don't cache
    }
]

This way, the large expensive context gets cached while small variable content stays fresh.

Measuring Your Savings

Track these metrics:

Cache creation tokens: The first request that creates the cache
Cache read tokens: Subsequent requests hitting the cache (multiply by 0.1 for actual cost)
Cache efficiency ratio: (total cache read tokens) / (initial cache creation + all subsequent non-cached input tokens)

An efficiency ratio above 3 means caching is worth the implementation effort.

The Bottom Line

Prompt caching isn't revolutionary—it's pragmatic. If you're sending the same context repeatedly, you're overpaying. Adding cache_control to your existing code is a 10-minute change that pays for itself immediately.

For teams running serious Claude workflows, it's low-hanging fruit. Implement it, measure the savings, and reinvest in better prompts or more requests with your new budget.

Prompt caching with Claude API: cut costs by 90% on repeated contexts

The Problem: Context Tokens Are Killing Your Budget

How Prompt Caching Works

Setting Up Caching: The Code

Real-World Scenarios Where Caching Pays Off

The Limits (and How to Work Around Them)

Advanced Pattern: Structured Caching

Measuring Your Savings

The Bottom Line

Related articles

Retrieval-augmented generation explained: when to use RAG vs long context

Prompt versioning: treating prompts like code with tests and changelogs

How to test prompts systematically instead of guessing what works