Prompt Caching: Save Up to 90% on LLM API Costs
Everything you need to know about prompt caching across Anthropic, OpenAI, and Google — how it works, when to use it, and how much you save.
Prompt caching is the single highest-ROI optimization for most LLM workloads. By reusing computed context across requests, you can reduce input token costs by 50–90%. Anthropic, OpenAI, Google, and DeepSeek all offer some form of caching, but they work differently.
This guide covers how caching works on each platform, when to use it, and how to measure your actual savings.
How Prompt Caching Works
LLMs compute a key-value cache (KV cache) as they process your prompt. Normally, this is discarded after each request. With prompt caching, providers store this computed cache and reuse it for subsequent requests that share the same prefix.
The key insight: if your requests share a long common prefix (system prompt + static documents), you only pay for full processing once. Subsequent requests with the same prefix pay a fraction of the normal rate.
Pricing by Provider
Platform-Specific Details
Anthropic (Claude)
Anthropic requires explicit cache control markers in your API request. You must add cache_control: { type: 'ephemeral' } to the content blocks you want cached. The cache TTL is 5 minutes by default (extendable). Minimum cacheable block size: 1,024 tokens.
Anthropic's cached pricing is among the best: Claude Haiku drops to $0.10/1M (90% off), Sonnet to $0.30/1M (90% off), and Opus to $0.50/1M (90% off) — the highest caching discount in the industry.
OpenAI (GPT-4o, GPT-4.1)
OpenAI uses automatic caching — no code changes required. Prompts that are ≥1,024 tokens and repeat frequently are cached automatically. Cached tokens are charged at 50% of the standard rate. The cache key is the exact prefix, so even a single character change invalidates caching for everything after that point.
Put stable content (system prompt, tool definitions, static documents) at the beginning of your prompt. Dynamic content (user messages, timestamps) should come last.
Google (Gemini)
Gemini offers context caching as an explicit API feature. You create a cache object from your content, then reference it by ID in your requests. Cache TTL is configurable (minimum 1 hour). Cached tokens cost approximately 75% less than standard rates.
Gemini's caching is especially valuable for its 1M context window — you can cache an entire book or codebase and query it repeatedly at reduced cost.
When to Use Caching
Measuring Cache Hit Rate
Most provider APIs return cache hit information in the response usage object. For Anthropic, look at usage.cache_read_input_tokens vs usage.input_tokens. For OpenAI, check usage.prompt_tokens_details.cached_tokens.
A healthy cache hit rate for a RAG pipeline is 60–80%. For a chatbot with a fixed system prompt, you should see 85–95% of system prompt tokens served from cache after the first request.
Real-World Impact
Consider a RAG pipeline running on Claude Sonnet 4.6 with:
- 3,500 input tokens per request (500 system prompt + 3,000 retrieved chunks)
- 500 output tokens
- 10,000 requests/day
- 75% of input tokens served from cache
Use our token cost calculator with the caching toggle to model your own workload, or see the full model comparison →