Back to Blog
PromptBuilder Team
November 29, 2025
9 min read
Featured

Prompt Caching & Token Economics in 2025: How to Cut Cost Without Losing Quality

Prompt caching became mainstream in 2025, with OpenAI, Anthropic, and Google all offering variants. If you're not using it, you're paying 10x more than you need to for repeated context. This guide explains how caching works across providers, when it saves you money, and when it doesn't.


What Is Prompt Caching?

Prompt caching means the model provider stores parts of your input (the "prefix") and reuses them across multiple requests without reprocessing.

Benefits:

  • Cost: Cached tokens cost 90% less than fresh tokens (OpenAI, Anthropic).
  • Latency: Cached prefixes skip reprocessing, reducing time-to-first-token by 50-80%.

How it works:

  1. You send a request with a long, stable prefix (e.g., system prompt + knowledge base).
  2. The provider caches that prefix (keyed by hash).
  3. On subsequent requests with the same prefix, only new tokens are processed.

Provider Comparison: OpenAI vs. Anthropic vs. Google

Feature OpenAI (GPT-5.1) Anthropic (Claude 3) Google (Gemini 3)
Automatic caching Yes (default) Opt-in via API flag Opt-in via API flag
Cache lifetime 5-10 minutes 5 minutes 15 minutes
Min prefix size 1024 tokens 1024 tokens 2048 tokens
Cost reduction 90% on cached input 90% on cached input 75% on cached input
Latency improvement 50-80% TTFT reduction 60-85% TTFT reduction 40-70% TTFT reduction
Extended caching Yes (GPT-5.1 Pro) No Yes (Gemini 3 Pro)

Key insight: OpenAI's automatic caching is the easiest to use. Anthropic and Google require explicit cached_content API parameters.


OpenAI Prompt Caching (GPT-5.1)

How It Works

OpenAI automatically caches stable prefixes (system prompt + user message prefix) if:

  • The prefix is ≥1024 tokens
  • The same prefix is used within 5-10 minutes

No API changes required. Just structure your prompt with a reusable prefix.

Example: Before Caching

# Request 1
response = openai.chat.completions.create(
    model="gpt-5.1-turbo",
    messages=[
        {"role": "system", "content": long_system_prompt},  # 2000 tokens
        {"role": "user", "content": "Analyze document 1"}    # 500 tokens
    ]
)
# Cost: 2500 input tokens @ $0.01/1k = $0.025

# Request 2 (new document)
response = openai.chat.completions.create(
    model="gpt-5.1-turbo",
    messages=[
        {"role": "system", "content": long_system_prompt},  # 2000 tokens (reprocessed!)
        {"role": "user", "content": "Analyze document 2"}    # 500 tokens
    ]
)
# Cost: 2500 input tokens @ $0.01/1k = $0.025
# TOTAL: $0.050

After Caching (Automatic)

# Request 1: Same as above ($0.025)

# Request 2 (within 10 min)
# OpenAI detects prefix match and caches system prompt
# Cost: 2000 cached tokens @ $0.001/1k + 500 fresh tokens @ $0.01/1k
# = $0.002 + $0.005 = $0.007
# TOTAL: $0.032 (36% savings)

With 10 requests, savings jump to ~70%.


GPT-5.1 Extended Caching (Pro Tier)

For long-running interactions (e.g., chatbots, code assistants), GPT-5.1 Pro extends cache lifetime:

  • Standard: 5-10 minutes
  • Extended: 60 minutes

When to use: Multi-turn conversations where the system prompt + memory layer stay constant.


Anthropic Prompt Caching (Claude 3)

How It Works

Anthropic requires explicit cache_control markers in your API request.

Example

response = anthropic.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # 2000 tokens
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "Analyze document 1"}
    ]
)

First request:

  • Writes 2000 tokens to cache (costs 1.25x normal rate for write overhead).
  • Processing: 2000 + 500 = 2500 tokens.

Second request (within 5 min):

  • Reads 2000 tokens from cache (costs 0.1x normal rate).
  • Processing: 500 fresh tokens.

Cost breakdown:

Request 1: 2500 tokens @ $0.015/1k (write) = $0.0375
Request 2: 2000 cached @ $0.0015/1k + 500 @ $0.015/1k = $0.003 + $0.0075 = $0.0105
TOTAL: $0.048 (vs $0.075 without caching = 36% savings)

Google Context Caching (Gemini 3)

How It Works

Google's cachedContent API pre-uploads stable context and returns a cache ID.

Example

import google.generativeai as genai

# Step 1: Create cached content
cache = genai.caching.CachedContent.create(
    model="gemini-3-pro",
    system_instruction=long_system_prompt,  # 2000 tokens
    ttl=datetime.timedelta(minutes=15)
)

# Step 2: Use cached content in requests
response = genai.GenerativeModel.generate_content(
    model="gemini-3-pro",
    cached_content=cache.name,
    contents="Analyze document 1"
)

Pricing:

  • Cache creation: 2000 tokens @ $0.012/1k = $0.024 (one-time)
  • Subsequent reads: 2000 tokens @ $0.003/1k = $0.006
  • Fresh tokens: 500 @ $0.012/1k = $0.006

Total for 10 requests:

Cache creation: $0.024
10 requests: 10 × (0.006 + 0.006) = $0.120
TOTAL: $0.144 (vs $0.300 without caching = 52% savings)

When Caching Saves You Money

✅ Use Caching If:

  1. Large, stable prefix: System prompt + knowledge base + few-shot examples (≥1024 tokens).
  2. Repeated use: Same prefix across ≥5 requests within cache lifetime.
  3. Batch processing: Analyzing 100 documents with the same instructions.
  4. Multi-turn conversations: Chatbots where system prompt doesn't change.

❌ Caching Won't Help If:

  1. Unique prompts: Every request has a different prefix.
  2. Small prompts: Total input <1024 tokens (cache overhead exceeds savings).
  3. Rare use: Requests are >10 minutes apart (cache expires).
  4. Dynamic context: System prompt changes on every call (e.g., personalized per user).

Checklist: Structuring Prompts for Caching

1. Put Stable Content First

Place reusable instructions at the start of the context.

Bad (not cacheable):

User message: "Analyze this report: {{document}}"
System prompt: "You are an analyst. Be concise."

Good (cacheable):

System prompt: "You are an analyst. Be concise."
Knowledge base: [1500 tokens of reference material]
User message: "Analyze this report: {{document}}"

2. Separate Variable Content

Keep the dynamic part (user input, current document) after the cached prefix.

Structure:

[CACHED]
- System prompt
- Long-term memory
- Knowledge base
- Few-shot examples
[/CACHED]

[FRESH]
- User's current request
- Document to analyze
[/FRESH]

3. Avoid Micro-Changes in Prefix

Even small changes invalidate the cache.

Bad:

Request 1: "You are a helpful assistant. Today is Nov 29."
Request 2: "You are a helpful assistant. Today is Nov 30."

(Cache miss due to date change)

Good:

System prompt: "You are a helpful assistant."
User message: "Today is Nov 30. Analyze..."

(System prompt cached; date is fresh each time)


4. Use Context Engineering Layers

Organize input into modular blocks (see Context Engineering Guide):

  • Layer 1-4 (stable): Cache these
  • Layer 5-6 (dynamic): Keep fresh

Cost Modeling: Cached vs. Non-Cached

Scenario: Analyzing 100 Documents

Setup:

  • System prompt + knowledge base: 3000 tokens
  • Each document: 1000 tokens
  • Model: GPT-5.1 Turbo

Without Caching:

100 requests × 4000 tokens @ $0.01/1k = $4.00

With Caching:

Request 1: 4000 tokens @ $0.01/1k = $0.04
Requests 2-100: 99 × (3000 cached @ $0.001/1k + 1000 fresh @ $0.01/1k)
               = 99 × ($0.003 + $0.01) = 99 × $0.013 = $1.29
TOTAL: $1.33 (67% savings)

Latency:

  • Without caching: ~2.5s/request → 250s total
  • With caching: ~0.8s/request → 80s total

Advanced: Multi-Tenant Caching

If you serve multiple users with shared resources (e.g., a knowledge base), use a hybrid caching strategy:

  1. Global prefix: Company knowledge base (cached, shared).
  2. User prefix: Individual user memory (cached per user).
  3. Task suffix: Current request (fresh).

Example (pseudo-code):

global_prefix = load_kb()  # 2000 tokens, cached globally
user_memory = load_user(user_id)  # 500 tokens, cached per user
task = get_user_input()  # 300 tokens, fresh

context = [global_prefix, user_memory, task]
response = model.generate(context)

Result: 83% of context is cached, reducing cost by ~75% across all users.


Provider-Specific Tips

OpenAI

  • Caching is automatic but only kicks in at 1024+ tokens.
  • Use structured system prompts (see GPT-5.1 guide).
  • Monitor via API response headers (x-cached-tokens).

Anthropic

  • Explicitly mark cache boundaries with cache_control.
  • Batch requests within 5-minute windows.
  • Check cache hits via usage.cache_read_input_tokens in response.

Google

  • Pre-create CachedContent objects during setup.
  • Set ttl to 15 minutes for short bursts; 60 minutes for long sessions (Pro).
  • Cache creation has upfront cost; pays off after 3-5 uses.

FAQ

Does caching affect output quality? No. The model processes cached tokens identically to fresh tokens.

Can I cache across different models? No. Caches are model-specific (e.g., GPT-5.1 Turbo cache ≠ GPT-5.1 Pro cache).

What happens if the cache expires mid-session? Next request pays full cost to rebuild the cache.

How do I know if caching is working? Check API response metadata: cached_tokens (OpenAI), cache_read_input_tokens (Anthropic), or logs (Google).

Should I cache everything? Only stable, reusable content ≥1024 tokens. Caching micro-prompts adds overhead.


Key Takeaways

  • Caching saves 50-90% on costs and 50-80% on latency for repeated prefixes.
  • Structure matters: Put stable content first; keep dynamic content last.
  • Provider differences: OpenAI auto-caches; Anthropic/Google require explicit API calls.
  • Break-even point: 3-5 requests within cache lifetime.
  • Don't cache: Unique prompts, small prompts, or infrequent requests.

Try It Now

  1. Identify your most-used prompt with a stable prefix (≥1024 tokens).
  2. Restructure to place stable content at the start.
  3. Track cost before/after over 10 requests.
  4. If using Anthropic or Google, implement cache_control or CachedContent.

Tool: Use PromptBuilder's cost calculator to model savings before deploying.

Next: Learn how to test and version prompts in CI/CD to ensure cached prompts don't degrade over time.


Summary

Prompt caching is the easiest way to cut AI costs in 2025. OpenAI, Anthropic, and Google all offer robust implementations with 50-90% savings and major latency reductions. The trick is structuring your input so stable content (system prompts, knowledge bases) forms a reusable prefix. Start with batch processing or multi-turn conversations, measure the ROI, and expand from there.