Prompt Caching & Token Economics in 2025: How to Cut Cost Without Losing Quality
Prompt caching became mainstream in 2025, with OpenAI, Anthropic, and Google all offering variants. If you're not using it, you're paying 10x more than you need to for repeated context. This guide explains how caching works across providers, when it saves you money, and when it doesn't.
What Is Prompt Caching?
Prompt caching means the model provider stores parts of your input (the "prefix") and reuses them across multiple requests without reprocessing.
Benefits:
- Cost: Cached tokens cost 90% less than fresh tokens (OpenAI, Anthropic).
- Latency: Cached prefixes skip reprocessing, reducing time-to-first-token by 50-80%.
How it works:
- You send a request with a long, stable prefix (e.g., system prompt + knowledge base).
- The provider caches that prefix (keyed by hash).
- On subsequent requests with the same prefix, only new tokens are processed.
Provider Comparison: OpenAI vs. Anthropic vs. Google
| Feature | OpenAI (GPT-5.1) | Anthropic (Claude 3) | Google (Gemini 3) |
|---|---|---|---|
| Automatic caching | Yes (default) | Opt-in via API flag | Opt-in via API flag |
| Cache lifetime | 5-10 minutes | 5 minutes | 15 minutes |
| Min prefix size | 1024 tokens | 1024 tokens | 2048 tokens |
| Cost reduction | 90% on cached input | 90% on cached input | 75% on cached input |
| Latency improvement | 50-80% TTFT reduction | 60-85% TTFT reduction | 40-70% TTFT reduction |
| Extended caching | Yes (GPT-5.1 Pro) | No | Yes (Gemini 3 Pro) |
Key insight: OpenAI's automatic caching is the easiest to use. Anthropic and Google require explicit cached_content API parameters.
OpenAI Prompt Caching (GPT-5.1)
How It Works
OpenAI automatically caches stable prefixes (system prompt + user message prefix) if:
- The prefix is ≥1024 tokens
- The same prefix is used within 5-10 minutes
No API changes required. Just structure your prompt with a reusable prefix.
Example: Before Caching
# Request 1
response = openai.chat.completions.create(
model="gpt-5.1-turbo",
messages=[
{"role": "system", "content": long_system_prompt}, # 2000 tokens
{"role": "user", "content": "Analyze document 1"} # 500 tokens
]
)
# Cost: 2500 input tokens @ $0.01/1k = $0.025
# Request 2 (new document)
response = openai.chat.completions.create(
model="gpt-5.1-turbo",
messages=[
{"role": "system", "content": long_system_prompt}, # 2000 tokens (reprocessed!)
{"role": "user", "content": "Analyze document 2"} # 500 tokens
]
)
# Cost: 2500 input tokens @ $0.01/1k = $0.025
# TOTAL: $0.050
After Caching (Automatic)
# Request 1: Same as above ($0.025)
# Request 2 (within 10 min)
# OpenAI detects prefix match and caches system prompt
# Cost: 2000 cached tokens @ $0.001/1k + 500 fresh tokens @ $0.01/1k
# = $0.002 + $0.005 = $0.007
# TOTAL: $0.032 (36% savings)
With 10 requests, savings jump to ~70%.
GPT-5.1 Extended Caching (Pro Tier)
For long-running interactions (e.g., chatbots, code assistants), GPT-5.1 Pro extends cache lifetime:
- Standard: 5-10 minutes
- Extended: 60 minutes
When to use: Multi-turn conversations where the system prompt + memory layer stay constant.
Anthropic Prompt Caching (Claude 3)
How It Works
Anthropic requires explicit cache_control markers in your API request.
Example
response = anthropic.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
system=[
{
"type": "text",
"text": long_system_prompt, # 2000 tokens
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": "Analyze document 1"}
]
)
First request:
- Writes 2000 tokens to cache (costs 1.25x normal rate for write overhead).
- Processing: 2000 + 500 = 2500 tokens.
Second request (within 5 min):
- Reads 2000 tokens from cache (costs 0.1x normal rate).
- Processing: 500 fresh tokens.
Cost breakdown:
Request 1: 2500 tokens @ $0.015/1k (write) = $0.0375
Request 2: 2000 cached @ $0.0015/1k + 500 @ $0.015/1k = $0.003 + $0.0075 = $0.0105
TOTAL: $0.048 (vs $0.075 without caching = 36% savings)
Google Context Caching (Gemini 3)
How It Works
Google's cachedContent API pre-uploads stable context and returns a cache ID.
Example
import google.generativeai as genai
# Step 1: Create cached content
cache = genai.caching.CachedContent.create(
model="gemini-3-pro",
system_instruction=long_system_prompt, # 2000 tokens
ttl=datetime.timedelta(minutes=15)
)
# Step 2: Use cached content in requests
response = genai.GenerativeModel.generate_content(
model="gemini-3-pro",
cached_content=cache.name,
contents="Analyze document 1"
)
Pricing:
- Cache creation: 2000 tokens @ $0.012/1k = $0.024 (one-time)
- Subsequent reads: 2000 tokens @ $0.003/1k = $0.006
- Fresh tokens: 500 @ $0.012/1k = $0.006
Total for 10 requests:
Cache creation: $0.024
10 requests: 10 × (0.006 + 0.006) = $0.120
TOTAL: $0.144 (vs $0.300 without caching = 52% savings)
When Caching Saves You Money
✅ Use Caching If:
- Large, stable prefix: System prompt + knowledge base + few-shot examples (≥1024 tokens).
- Repeated use: Same prefix across ≥5 requests within cache lifetime.
- Batch processing: Analyzing 100 documents with the same instructions.
- Multi-turn conversations: Chatbots where system prompt doesn't change.
❌ Caching Won't Help If:
- Unique prompts: Every request has a different prefix.
- Small prompts: Total input <1024 tokens (cache overhead exceeds savings).
- Rare use: Requests are >10 minutes apart (cache expires).
- Dynamic context: System prompt changes on every call (e.g., personalized per user).
Checklist: Structuring Prompts for Caching
1. Put Stable Content First
Place reusable instructions at the start of the context.
Bad (not cacheable):
User message: "Analyze this report: {{document}}"
System prompt: "You are an analyst. Be concise."
Good (cacheable):
System prompt: "You are an analyst. Be concise."
Knowledge base: [1500 tokens of reference material]
User message: "Analyze this report: {{document}}"
2. Separate Variable Content
Keep the dynamic part (user input, current document) after the cached prefix.
Structure:
[CACHED]
- System prompt
- Long-term memory
- Knowledge base
- Few-shot examples
[/CACHED]
[FRESH]
- User's current request
- Document to analyze
[/FRESH]
3. Avoid Micro-Changes in Prefix
Even small changes invalidate the cache.
Bad:
Request 1: "You are a helpful assistant. Today is Nov 29."
Request 2: "You are a helpful assistant. Today is Nov 30."
(Cache miss due to date change)
Good:
System prompt: "You are a helpful assistant."
User message: "Today is Nov 30. Analyze..."
(System prompt cached; date is fresh each time)
4. Use Context Engineering Layers
Organize input into modular blocks (see Context Engineering Guide):
- Layer 1-4 (stable): Cache these
- Layer 5-6 (dynamic): Keep fresh
Cost Modeling: Cached vs. Non-Cached
Scenario: Analyzing 100 Documents
Setup:
- System prompt + knowledge base: 3000 tokens
- Each document: 1000 tokens
- Model: GPT-5.1 Turbo
Without Caching:
100 requests × 4000 tokens @ $0.01/1k = $4.00
With Caching:
Request 1: 4000 tokens @ $0.01/1k = $0.04
Requests 2-100: 99 × (3000 cached @ $0.001/1k + 1000 fresh @ $0.01/1k)
= 99 × ($0.003 + $0.01) = 99 × $0.013 = $1.29
TOTAL: $1.33 (67% savings)
Latency:
- Without caching: ~2.5s/request → 250s total
- With caching: ~0.8s/request → 80s total
Advanced: Multi-Tenant Caching
If you serve multiple users with shared resources (e.g., a knowledge base), use a hybrid caching strategy:
- Global prefix: Company knowledge base (cached, shared).
- User prefix: Individual user memory (cached per user).
- Task suffix: Current request (fresh).
Example (pseudo-code):
global_prefix = load_kb() # 2000 tokens, cached globally
user_memory = load_user(user_id) # 500 tokens, cached per user
task = get_user_input() # 300 tokens, fresh
context = [global_prefix, user_memory, task]
response = model.generate(context)
Result: 83% of context is cached, reducing cost by ~75% across all users.
Provider-Specific Tips
OpenAI
- Caching is automatic but only kicks in at 1024+ tokens.
- Use structured system prompts (see GPT-5.1 guide).
- Monitor via API response headers (
x-cached-tokens).
Anthropic
- Explicitly mark cache boundaries with
cache_control. - Batch requests within 5-minute windows.
- Check cache hits via
usage.cache_read_input_tokensin response.
- Pre-create
CachedContentobjects during setup. - Set
ttlto 15 minutes for short bursts; 60 minutes for long sessions (Pro). - Cache creation has upfront cost; pays off after 3-5 uses.
FAQ
Does caching affect output quality? No. The model processes cached tokens identically to fresh tokens.
Can I cache across different models? No. Caches are model-specific (e.g., GPT-5.1 Turbo cache ≠ GPT-5.1 Pro cache).
What happens if the cache expires mid-session? Next request pays full cost to rebuild the cache.
How do I know if caching is working?
Check API response metadata: cached_tokens (OpenAI), cache_read_input_tokens (Anthropic), or logs (Google).
Should I cache everything? Only stable, reusable content ≥1024 tokens. Caching micro-prompts adds overhead.
Key Takeaways
- Caching saves 50-90% on costs and 50-80% on latency for repeated prefixes.
- Structure matters: Put stable content first; keep dynamic content last.
- Provider differences: OpenAI auto-caches; Anthropic/Google require explicit API calls.
- Break-even point: 3-5 requests within cache lifetime.
- Don't cache: Unique prompts, small prompts, or infrequent requests.
Try It Now
- Identify your most-used prompt with a stable prefix (≥1024 tokens).
- Restructure to place stable content at the start.
- Track cost before/after over 10 requests.
- If using Anthropic or Google, implement
cache_controlorCachedContent.
Tool: Use PromptBuilder's cost calculator to model savings before deploying.
Next: Learn how to test and version prompts in CI/CD to ensure cached prompts don't degrade over time.
Summary
Prompt caching is the easiest way to cut AI costs in 2025. OpenAI, Anthropic, and Google all offer robust implementations with 50-90% savings and major latency reductions. The trick is structuring your input so stable content (system prompts, knowledge bases) forms a reusable prefix. Start with batch processing or multi-turn conversations, measure the ROI, and expand from there.


