Prompt Caching & Token Economics in 2025: How to Cut Cost Without Losing Quality

Prompt caching hit the mainstream in 2025. OpenAI, Anthropic, and Google all offer it now. If you're not using it, you're probably paying 10x more than you need to for repeated context. This guide covers how caching works across providers, when it actually saves money, and when it's not worth the effort. (New to prompt engineering? Start with our complete guide first.)
What Is Prompt Caching?
Prompt caching is simple: the model provider stores parts of your input (the "prefix") and reuses them across requests without reprocessing.
Why does this matter?
- Cost: Cached tokens cost 90% less than fresh tokens on OpenAI and Anthropic.
- Latency: Cached prefixes skip reprocessing, cutting time-to-first-token by 50-80%.
Here's how it works:
- You send a request with a long, stable prefix (like a system prompt plus a knowledge base).
- The provider caches that prefix using a hash.
- On later requests with the same prefix, only the new tokens get processed.
Provider Comparison: OpenAI vs. Anthropic vs. Google
| Feature | OpenAI (GPT-5.1) | Anthropic (Claude 3) | Google (Gemini 3) |
|---|---|---|---|
| Automatic caching | Yes (default) | Opt-in via API flag | Opt-in via API flag |
| Cache lifetime | 5-10 minutes | 5 minutes | 15 minutes |
| Min prefix size | 1024 tokens | 1024 tokens | 2048 tokens |
| Cost reduction | 90% on cached input | 90% on cached input | 75% on cached input |
| Latency improvement | 50-80% TTFT reduction | 60-85% TTFT reduction | 40-70% TTFT reduction |
| Extended caching | Yes (GPT-5.1 Pro) | No | Yes (Gemini 3 Pro) |
The bottom line: OpenAI's automatic caching is the easiest to work with. Anthropic and Google both require explicit cached_content API parameters. For provider-specific prompting techniques, check out our guides on Claude and Gemini.
OpenAI Prompt Caching (GPT-5.1)
How It Works
OpenAI automatically caches stable prefixes (system prompt plus user message prefix) when two conditions are met:
- The prefix is at least 1024 tokens
- You reuse the same prefix within 5-10 minutes
No API changes needed. Just structure your prompt with a reusable prefix and OpenAI handles the rest.
Example: Without Caching
# Request 1
response = openai.chat.completions.create(
model="gpt-5.1-turbo",
messages=[
{"role": "system", "content": long_system_prompt}, # 2000 tokens
{"role": "user", "content": "Analyze document 1"} # 500 tokens
]
)
# Cost: 2500 input tokens @ $0.01/1k = $0.025
# Request 2 (new document)
response = openai.chat.completions.create(
model="gpt-5.1-turbo",
messages=[
{"role": "system", "content": long_system_prompt}, # 2000 tokens (reprocessed!)
{"role": "user", "content": "Analyze document 2"} # 500 tokens
]
)
# Cost: 2500 input tokens @ $0.01/1k = $0.025
# TOTAL: $0.050
After Caching (Automatic)
# Request 1: Same as above ($0.025)
# Request 2 (within 10 min)
# OpenAI detects prefix match and caches system prompt
# Cost: 2000 cached tokens @ $0.001/1k + 500 fresh tokens @ $0.01/1k
# = $0.002 + $0.005 = $0.007
# TOTAL: $0.032 (36% savings)
With 10 requests, savings jump to ~70%.
GPT-5.1 Extended Caching (Pro Tier)
For longer interactions like chatbots or code assistants, GPT-5.1 Pro gives you extended cache lifetime:
- Standard: 5-10 minutes
- Extended: 60 minutes
This is useful for multi-turn conversations where the system prompt and memory layer stay the same throughout.
Anthropic Prompt Caching (Claude 3)
How It Works
Unlike OpenAI, Anthropic requires you to explicitly mark what should be cached using cache_control in your API request.
Example
response = anthropic.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # 2000 tokens
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": "Analyze document 1"}
]
)
On the first request, it writes 2000 tokens to cache (which costs 1.25x normal rate for the write overhead) and processes 2500 tokens total.
On the second request (within 5 minutes), it reads 2000 tokens from cache (at 0.1x normal rate) and only processes 500 fresh tokens.
Here's the cost breakdown:
Request 1: 2500 tokens @ $0.015/1k (write) = $0.0375
Request 2: 2000 cached @ $0.0015/1k + 500 @ $0.015/1k = $0.003 + $0.0075 = $0.0105
TOTAL: $0.048 (vs $0.075 without caching = 36% savings)
Google Context Caching (Gemini 3)
How It Works
Google takes a different approach. Their cachedContent API lets you pre-upload stable context and get back a cache ID that you reuse.
Example
import google.generativeai as genai
# Step 1: Create cached content
cache = genai.caching.CachedContent.create(
model="gemini-3-pro",
system_instruction=long_system_prompt, # 2000 tokens
ttl=datetime.timedelta(minutes=15)
)
# Step 2: Use cached content in requests
response = genai.GenerativeModel.generate_content(
model="gemini-3-pro",
cached_content=cache.name,
contents="Analyze document 1"
)
Here's how the pricing breaks down:
- Cache creation: 2000 tokens at $0.012/1k = $0.024 (one-time)
- Subsequent reads: 2000 tokens at $0.003/1k = $0.006
- Fresh tokens: 500 at $0.012/1k = $0.006
For 10 requests, that's:
Cache creation: $0.024
10 requests: 10 × (0.006 + 0.006) = $0.120
TOTAL: $0.144 (vs $0.300 without caching = 52% savings)
When Caching Saves You Money
Use Caching If:
- You have a large, stable prefix. This includes system prompts, knowledge bases, and few-shot examples that add up to 1024+ tokens.
- You're reusing the same prefix. You need at least 5 requests within the cache lifetime to see real savings.
- You're doing batch processing. Analyzing 100 documents with the same instructions is a perfect use case.
- You're building multi-turn conversations. Chatbots where the system prompt stays constant benefit a lot. (See also: prompt chaining for complex workflows.)
Caching Won't Help If:
- Every request has a different prefix. No reuse means no savings.
- Your prompts are small. If your total input is under 1024 tokens, the cache overhead costs more than you save.
- Requests are spread out. If calls are more than 10 minutes apart, the cache expires before you can use it.
- Your context changes every time. If the system prompt is personalized per user, there's nothing to cache.
How to Structure Prompts for Caching
If you're looking for a quick reference while restructuring your prompts, our prompt engineering checklist can help.
1. Put Stable Content First
Your reusable instructions should go at the start of the context.
This won't cache well:
User message: "Analyze this report: {{document}}"
System prompt: "You are an analyst. Be concise."
This will:
System prompt: "You are an analyst. Be concise."
Knowledge base: [1500 tokens of reference material]
User message: "Analyze this report: {{document}}"
2. Separate Variable Content
Keep the dynamic part (user input, current document) after the cached prefix.
Think of it like this:
[CACHED]
- System prompt
- Long-term memory
- Knowledge base
- Few-shot examples
[/CACHED]
[FRESH]
- User's current request
- Document to analyze
[/FRESH]
3. Avoid Micro-Changes in Prefix
Even small changes invalidate the cache. This is a common mistake.
This causes a cache miss:
Request 1: "You are a helpful assistant. Today is Nov 29."
Request 2: "You are a helpful assistant. Today is Nov 30."
The date change breaks the cache. Instead, do this:
System prompt: "You are a helpful assistant."
User message: "Today is Nov 30. Analyze..."
Now the system prompt stays cached and the date goes in the fresh part.
4. Use Context Engineering Layers
Organize your input into modular blocks (more on this in our Context Engineering Guide):
- Layers 1-4 (stable): Cache these
- Layers 5-6 (dynamic): Keep these fresh
Cost Modeling: Cached vs. Non-Cached
Scenario: Analyzing 100 Documents
Let's say you have:
- A system prompt plus knowledge base totaling 3000 tokens
- 100 documents, each around 1000 tokens
- Using GPT-5.1 Turbo
Without caching:
100 requests × 4000 tokens @ $0.01/1k = $4.00
With caching:
Request 1: 4000 tokens @ $0.01/1k = $0.04
Requests 2-100: 99 × (3000 cached @ $0.001/1k + 1000 fresh @ $0.01/1k)
= 99 × ($0.003 + $0.01) = 99 × $0.013 = $1.29
TOTAL: $1.33 (67% savings)
The latency improvement is just as good:
- Without caching: around 2.5s per request, so 250s total
- With caching: around 0.8s per request, so 80s total
Advanced: Multi-Tenant Caching
If you're serving multiple users who share some resources (like a company knowledge base), you can use a hybrid caching strategy:
- Global prefix: Your company knowledge base, cached and shared across all users.
- User prefix: Individual user memory, cached per user.
- Task suffix: The current request, always fresh.
Here's what that looks like in practice:
global_prefix = load_kb() # 2000 tokens, cached globally
user_memory = load_user(user_id) # 500 tokens, cached per user
task = get_user_input() # 300 tokens, fresh
context = [global_prefix, user_memory, task]
response = model.generate(context)
With this setup, 83% of your context is cached, cutting costs by about 75% across all users.
Provider-Specific Tips
OpenAI
- Caching is automatic but only kicks in at 1024+ tokens.
- Use structured system prompts (see our GPT-5.1 guide).
- You can monitor cache usage via the
x-cached-tokensheader in API responses.
Anthropic
- You need to explicitly mark cache boundaries with
cache_control. - Try to batch requests within 5-minute windows to maximize cache hits.
- Check if caching is working by looking at
usage.cache_read_input_tokensin the response. - Need help writing Claude prompts? Try our Claude prompt generator.
- Create your
CachedContentobjects during setup, not on the fly. - Use 15 minutes for short bursts, 60 minutes for longer sessions (Pro tier).
- Keep in mind that cache creation has an upfront cost. It pays off after 3-5 uses.
- Need help writing Gemini prompts? Try our Gemini prompt generator.
FAQ
Does caching affect output quality? No. The model processes cached tokens exactly the same as fresh tokens.
Can I cache across different models? No. Caches are model-specific. A GPT-5.1 Turbo cache won't work with GPT-5.1 Pro.
What happens if the cache expires mid-session? Your next request pays full cost to rebuild the cache.
How do I know if caching is working?
Check the API response metadata. Look for cached_tokens (OpenAI), cache_read_input_tokens (Anthropic), or your logs (Google).
Should I cache everything? No. Only cache stable, reusable content that's at least 1024 tokens. Caching small prompts adds more overhead than it saves.
Key Takeaways
- Caching saves 50-90% on costs and 50-80% on latency for repeated prefixes.
- Structure matters. Put stable content first, dynamic content last.
- OpenAI auto-caches. Anthropic and Google require explicit API calls.
- You need 3-5 requests within the cache lifetime to break even.
- Don't bother caching unique prompts, small prompts, or infrequent requests.
Try It Now
- Find your most-used prompt that has a stable prefix of 1024+ tokens.
- Restructure it so the stable content comes first.
- Track your costs before and after over 10 requests.
- If you're using Anthropic or Google, add the
cache_controlorCachedContentparameters.
You can use Prompt Builder's cost calculator to model your potential savings before deploying.
Up next: Learn how to test and version prompts in CI/CD so your cached prompts don't degrade over time.
Summary
Prompt caching is probably the easiest way to cut your AI costs right now. OpenAI, Anthropic, and Google all have solid implementations that can save you 50-90% and make your app noticeably faster. The key is structuring your input so stable content like system prompts and knowledge bases forms a reusable prefix. Start with batch processing or multi-turn conversations, measure what you're saving, and expand from there.
Still deciding which provider to use? Check out our Claude vs ChatGPT vs Gemini comparison.


