Prompt Caching Guide to Cut AI Costs

Prompt Caching Guide to Cut AI Costs

Technical guide · April 2026 pricing

Prompt Caching Guide: cut AI costs up to 95% (2026).

The complete technical guide to prompt caching across Anthropic, OpenAI, and Gemini. Real pricing, working code, interactive savings calculator, and honest analysis of when caching actually pays off.

3 providers 7 models Live calculator 90% discounts

One engineer cut his Anthropic bill from $720 to $72 per month with a single line of code. That is not a marketing headline. That is the actual ROI when prompt caching fits your workload.

Here is what caching actually is: a server-side optimization that stores the processed tokens of your repeated prompt prefixes, so the model does not recompute them on every call. When the same system prompt, tool schema, or RAG document shows up in a second request, the provider serves those tokens from cache at a steep discount. Anthropic gives 90% off. OpenAI gives 50% off automatically. Google Gemini gives 75-90% off depending on model generation. Stack caching with batch processing and you can hit 95% total input cost reduction.

This guide does what most caching articles do not: real per-provider implementation code, an interactive calculator for your specific workload, and honest boundary conditions for when caching actually pays off versus when the overhead kills the savings.

INFOGRAPHIC 01 / HOW IT WORKS The mechanics. First call writes the cache. Every subsequent call hits it for 10-50% of the price. CALL 1 / CACHE WRITE CACHEABLE PREFIX System prompt (5K tokens) + Tool schemas + RAG docs DYNAMIC SUFFIX User message (100 tokens) PROCESS COST BREAKDOWN 5,000 tokens written to cache 1.25x 100 tokens regular input 1.00x First call: small premium $0.019 CALLS 2+ / CACHE HIT SAME PREFIX (CACHED) System prompt (5K tokens) + Tool schemas + RAG docs NEW DYNAMIC SUFFIX Different user message SKIP COST BREAKDOWN 5,000 cached tokens read 0.10x 100 tokens regular input 1.00x Cache hit: 90% off $0.0018 Break-even at call 2. Every call after is pure savings. Example uses Anthropic Claude Sonnet 4.6 pricing, 5-min cache TTL. PROMPTLEADZ · SECTION 01 SECTION How It Works the mechanics in plain English Fundamentals

The core mechanism.

When you send a prompt to a language model, the provider has to tokenize your input, run it through the model's attention layers, and compute intermediate states (called KV cache states). This work happens for every request, even if you just sent the same 10,000-token system prompt two seconds ago. Prompt caching breaks that wastefulness. The provider stores the computed states of your prefix server-side. When another request comes in with the same prefix, the provider loads the cached state and only processes the new tokens.

The economics are simple. Processing 10,000 tokens costs compute. Serving them from a warm cache costs effectively nothing. Providers pass most of that savings back to you. Anthropic charges 10% of the standard input rate for cache reads. OpenAI charges 50%. Gemini ranges from 10-25% depending on model generation.

There is a small first-call tax. On Anthropic, the initial cache write costs 25% more than standard input (1.25x multiplier on Sonnet and Opus). On OpenAI, cache writes are free. On Gemini explicit caching, writes are free but you pay storage fees per hour. Break-even for Anthropic is the second call. After that, every subsequent call is pure savings.

What should go in the cache.

Caching works on the prefix of your prompt, meaning anything that appears at the start of your message and stays identical across requests. The order of content matters enormously. Put static content first. Put dynamic content last.

Things worth caching include system prompts over 500 tokens, tool and function schema definitions, RAG document context that multiple queries reference, few-shot example blocks, long persona or brand guideline documents, and full codebases when using coding assistants. The common pattern: anything substantial that does not change based on the current user turn.

Things not worth caching include user messages (they change every call), short system prompts under 1,024 tokens (below the minimum threshold), and content that varies even slightly per request. Even a single different token in the prefix breaks the cache match.

INFOGRAPHIC 02 / PROVIDER COMPARISON All three providers, side by side. How caching differs across Anthropic, OpenAI, and Google. ANTHROPIC OPENAI GOOGLE GEMINI CACHE READ DISCOUNT 90% off CACHE READ DISCOUNT 50% off CACHE READ DISCOUNT 75-90% off MINIMUM TOKENS 1,024 (2,048 on Haiku) MINIMUM TOKENS 1,024 (128-token increments) MINIMUM TOKENS 32,768 (explicit mode) CACHE DURATION 5 min or 1 hour (explicit) CACHE DURATION 5-10 min (automatic) CACHE DURATION Custom (paid storage) IMPLEMENTATION Explicit (cache_control) IMPLEMENTATION Automatic (zero code) IMPLEMENTATION Implicit + Explicit STORAGE COST Free STORAGE COST Free STORAGE COST $1/M/hr (explicit) BEST FOR Max savings on small-to-mid context, precise cache control BEST FOR Zero-effort setup, chatbots, production code simplicity BEST FOR Very large docs, videos, multimodal content PROMPTLEADZ · SECTION 02 SECTION Implementation the actual code per provider Code samples

Anthropic (explicit, deepest discount).

Anthropic caching is explicit but generous. You add a cache_control marker to each block you want cached. You can place up to four breakpoints per request. Each marker caches everything from the start of the message up to that point.

ANTHROPIC / PYTHON
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert sales coach. Use the guidelines below."
        },
        {
            "type": "text",
            "text": long_brand_guidelines,  # 8,000 tokens that never change
            "cache_control": {"type": "ephemeral"}  # ← caches everything above
        }
    ],
    messages=[
        {"role": "user", "content": "Draft a cold email to a fintech CTO."}
    ]
)

# Response usage tells you what hit the cache
print(response.usage.cache_read_input_tokens)   # 8000 (on second call)
print(response.usage.cache_creation_input_tokens)  # 8000 (on first call only)

Anthropic caches default to a 5-minute TTL that extends every time the cache is accessed. For workloads where requests are spaced more than 5 minutes apart, switch to the 1-hour TTL by adding "cache_control": {"type": "ephemeral", "ttl": "1h"}. The 1-hour cache costs 2x base input instead of 1.25x on write, but hits far more often for moderate-frequency workloads.

A critical debugging detail: cache is scoped to your API key and region. Requests routed through different Anthropic regions (us-east-1 vs eu-west-1 on Bedrock) maintain separate caches. If you are using a multi-region deployment and cache hits look low, check your routing.

OpenAI (automatic, zero code).

OpenAI caching is the simplest of the three. There is no flag, no config, no API changes. Once your prompt exceeds 1,024 tokens, caching activates automatically. Cache hits happen in 128-token increments (so 1,024, 1,152, 1,280, and so on). Discount is 50% on cached tokens.

OPENAI / PYTHON
from openai import OpenAI

client = OpenAI()

# No cache markers needed. Just send a prompt over 1,024 tokens.
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": long_system_prompt},  # 5,000 tokens
        {"role": "user", "content": "Analyze this deal..."}
    ]
)

# Response shows cache hits in the usage object
print(response.usage.prompt_tokens_details.cached_tokens)  # 4992 on hit

OpenAI cache TTL is 5-10 minutes and extends during off-peak hours up to around one hour. You cannot control TTL directly. The tradeoff is simplicity: zero code changes versus less control.

One sharp edge: OpenAI cache is automatic but not guaranteed. If the system is under load, caching can be skipped silently. Monitor cached_tokens in your response usage to confirm hits. Target 80%+ hit rate on production workloads. If you see lower rates, your prefix may be changing between requests in ways you did not intend (timestamps, dynamic IDs, user-specific context at the top of the prompt).

Google Gemini (hybrid, longest TTL).

Gemini supports both implicit caching (automatic, on Gemini 2.5 models) and explicit caching (manual, higher discount). The explicit mode requires creating a cache object via the API, referencing it in subsequent requests, and deleting it when done. Storage is billed per hour, not free like Anthropic and OpenAI.

GEMINI / PYTHON (EXPLICIT CACHE)
from google import genai
from google.genai.types import CreateCachedContentConfig

client = genai.Client()

# Step 1: create the cache (minimum 32,768 tokens)
cache = client.caches.create(
    model="gemini-2.5-pro",
    config=CreateCachedContentConfig(
        contents=large_document,  # 50K+ tokens
        ttl="3600s"  # 1 hour
    )
)

# Step 2: reference the cache in requests
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents="Summarize the key findings.",
    config={"cached_content": cache.name}
)

# Step 3: delete when done to stop storage billing
client.caches.delete(name=cache.name)

The Gemini storage fee is $1 per million tokens per hour on 2.5 Pro ($4.50/M/hour on 1.5). For a 100K token cache held for 1 hour, that is $0.10. Trivial if you make 100 requests against it. Wasteful if you make 2.

Gemini 2.5 implicit caching activates automatically when your prompt matches a previously-seen prefix, with no API changes required. The discount is the same 90% but you do not control the lifetime. For unpredictable workloads, implicit is safer. For predictable high-volume workloads, explicit gives you control over TTL and confirmed hit rates.

Pair this with the other free tools

Caching saves money. Cost calculator shows you how much.

Once you have caching configured, run your monthly workload through the free AI cost calculator to see the total spend across providers. Or use the token counter to right-size your prefixes before you ship them.

Open the cost calculator →

Calculate your caching savings.

Tell the calculator your workload. See your monthly savings side-by-side with and without caching.

PROMPTLEADZ · SECTION 03 SECTION When It Pays Off honest boundary conditions Economics INFOGRAPHIC 03 / WHEN IT PAYS When caching actually saves you money. Three questions. Honest answers. Prefix > 1,024 tokens? System prompt + tools + RAG docs NO SKIP CACHING Below minimum Same prefix reused? 2+ calls within TTL window NO SKIP CACHING Write cost wasted Calls within TTL? 5 min Anthropic / 5-10 min OpenAI NO 1-HR TTL Costs more, hits more CACHE IT Expected savings: 40-85% Stack with batch = up to 95% BEST FIT USE CASES Chatbots Shared system prompt across every user conversation. 95%+ hit rate typical RAG / document Q&A Same document, multiple follow-up questions. 60-80% hit rate Batch pipelines Same prompt template, many inputs in sequence. 99%+ hit rate

The break-even math.

On Anthropic, first-call cache write costs 1.25x base input. Cache reads cost 0.10x. Break-even is the second call. If your prefix gets reused at least once within the TTL window, you are net positive. The more reuse, the better the economics: 3 calls with 100% hit rate saves roughly 60% off the prefix cost. 10 calls saves roughly 82%. 100 calls saves roughly 89%.

On OpenAI, there is no write premium. Break-even is call one. Every cache hit immediately saves 50%. The downside is less control: 5-10 minute TTL, no way to extend intentionally, no guarantee the system will cache during peak load.

On Gemini explicit caching, break-even depends on the storage cost amortization. For a 100K token cache at $1/M/hr storage, held for 1 hour, costs $0.10. Against a $0.20 per request base cost saved, break-even is call 2. But if you only make one request per hour, you are paying storage for no benefit. Implicit caching avoids this problem entirely at the cost of less control.

The workloads where caching fails.

Caching does not save money in several common scenarios that teams routinely get wrong.

Spread-out requests. If your workload makes one request every 30 minutes, the 5-minute Anthropic cache expires between calls. Every request becomes a cache miss with write premium. Switch to 1-hour TTL or accept that caching does not fit.

Output-heavy workloads. Caching only reduces input costs. If you send 1,000 input tokens and generate 10,000 output tokens per call, caching saves maybe 3% of total cost. Not worth the setup complexity.

Dynamic prefixes. Putting the current timestamp, user ID, or session info at the start of your system prompt breaks cache matching even if the rest of the prompt is identical. Move dynamic content to the user message, keep the system prompt pure.

Sub-minimum prefixes. On Anthropic and OpenAI, prompts under 1,024 tokens cannot be cached. On Gemini explicit caching, under 32,768 tokens. Check your prefix length before investing in caching architecture.

Stacking caching with batch processing.

The biggest savings come from stacking. Anthropic batch and OpenAI batch both offer 50% off standard token prices in exchange for 24-hour async processing. Batch discounts apply on top of cache discounts.

Worked example on Sonnet 4.6 with a 10M input token workload:

Standard pricing: 10M input tokens at $3/M = $30.00.
Batch only (50% off): $15.00.
Cache only (90% hit rate): 9M cached at $0.30/M + 1M standard at $3/M = $2.70 + $3.00 = $5.70.
Batch + Cache stacked: 9M cached at $0.15/M (batch halves cache price too) + 1M batch standard at $1.50/M = $1.35 + $1.50 = $2.85.

Total savings versus standard: 90.5%. For any offline processing pipeline (data enrichment, classification, document summarization, evaluation runs), the batch+cache stack is the single highest-ROI change you can make to your API bill.

Monitoring cache performance.

Caching is one of those optimizations that looks working until it is not. Hit rates silently degrade for dozens of reasons: a teammate added a timestamp to the system prompt, a model version upgrade invalidated caches, a region change broke key scoping. Monitor every response.

On Anthropic, check response.usage.cache_read_input_tokens and response.usage.cache_creation_input_tokens. Read count should dominate after the first few calls. On OpenAI, check response.usage.prompt_tokens_details.cached_tokens. Should be roughly equal to your prefix length minus the 128-token remainder.

Target cache hit rates by use case: chatbot with shared system prompt 95%+, RAG with document rotation 60-80%, batch processing with same prompt 99%+. If you are below these thresholds, investigate. The issue is almost always an inadvertently dynamic prefix, not a caching bug.

Questions people ask.

What is prompt caching?

Prompt caching is a feature that stores the processed state of repeated input tokens so the model does not recompute them on subsequent requests. When a new prompt matches a previously cached prefix, the provider serves those tokens from cache at a reduced price: 50% off at OpenAI, 90% off at Anthropic, and 75-90% off at Google Gemini.

How much can prompt caching save?

Real-world savings range from 40-85% of input token costs depending on cache hit rate and proportion of cacheable tokens. Stacked with batch processing (50% off), combined savings reach up to 95%. A team processing 500K documents per month can save $750-$2,250 monthly with caching alone.

Does prompt caching require code changes?

OpenAI caching is fully automatic once prompts exceed 1,024 tokens, no code needed. Anthropic requires explicit cache_control markers on each cacheable block. Google Gemini offers both implicit (automatic, for 2.5 models) and explicit (manual, required for 1.5 models) modes.

What is the minimum token count for caching?

OpenAI requires 1,024 tokens minimum. Anthropic requires 1,024 tokens for Sonnet and Opus models, 2,048 for Haiku. Google Gemini explicit caching requires 32,768 tokens minimum. Implicit caching on Gemini 2.5 activates at lower thresholds automatically.

How long does a cache last?

Anthropic offers 5-minute (default) or 1-hour TTL options. OpenAI caches automatically expire after 5-10 minutes of inactivity, with extended durations during off-peak times. Google Gemini allows custom TTL with per-hour storage fees of $1 per million tokens per hour on explicit caches.

Is prompt caching worth it for small workloads?

Break-even is usually the second call. First call has a small cache-write premium (25% extra on Anthropic, free on OpenAI). From call 2 onward, every hit saves 50-90%. If your workload makes 3+ calls reusing the same prefix within the TTL, caching pays off. Below that volume, the overhead is not worth it.

Can I use prompt caching with batch processing?

Yes on Anthropic and OpenAI. Batch processing gives 50% off base input costs, which stacks with cache discounts. A 10M input token workload on Sonnet 4.6 costs $30 standard, $15 batch, $3 with caching, or $1.50 with batch plus caching. That is 95% total savings.

When should I NOT use prompt caching?

Skip caching when your prompt prefix changes on every request, when requests are spaced wider than the TTL (so cache always expires), when your workload is output-heavy rather than input-heavy (caching only reduces input costs), or when your prompt is below the minimum token threshold (1,024 for Anthropic and OpenAI, 32,768 for Gemini explicit).

Does cache persist across API keys or regions?

No. Cache is scoped to your API key and the provider's infrastructure region. Requests routed through different regions (AWS Bedrock us-east-1 vs eu-west-1, for example) maintain separate caches. Multi-region deployments need to account for this in hit rate monitoring.

Official sources referenced

After you optimize the bill

Caching saves input tokens. Good prompts save everything else.

A cached prompt that wastes 500 output tokens on useless preamble is still wasting money. The Vault has 50 pre-built B2B sales agents tuned for output efficiency, cacheability, and tight token budgets.

Get the Vault $99.99
All Access $99.99

Leave a comment: