prompt cachingcost optimizationtutorial

Prompt Caching: Save Up to 90% on LLM API Costs

Everything you need to know about prompt caching across Anthropic, OpenAI, and Google — how it works, when to use it, and how much you save.

TTokenCost Editorial·LLM Cost Research·Updated 2026-04-225 min read

Prompt caching is the single highest-ROI optimization for most LLM workloads. By reusing computed context across requests, you can reduce input token costs by 50–90%. Anthropic, OpenAI, Google, and DeepSeek all offer some form of caching, but they work differently.

This guide covers how caching works on each platform, when to use it, and how to measure your actual savings.

How Prompt Caching Works

LLMs compute a key-value cache (KV cache) as they process your prompt. Normally, this is discarded after each request. With prompt caching, providers store this computed cache and reuse it for subsequent requests that share the same prefix.

The key insight: if your requests share a long common prefix (system prompt + static documents), you only pay for full processing once. Subsequent requests with the same prefix pay a fraction of the normal rate.

Pricing by Provider

Model	Standard /1M	Cached /1M	Savings
GPT-5	$8	$4	50%
GPT-5 Mini	$0.6	$0.3	50%
GPT-4.1	$2	$1	50%
o1	$15	$7.5	50%
GPT-4o	$2.5	$1.25	50%
Claude Fable 5	$10	$1	90%
Claude Opus 4.8	$5	$0.5	90%
Claude Opus 4.7	$5	$0.5	90%
Claude Sonnet 4.6	$3	$0.3	90%
Claude Opus 4.5	$15	$1.5	90%
Claude Haiku 4.5	$1	$0.1	90%
Claude 3.5 Sonnet	$3	$0.3	90%
Claude 3.5 Haiku	$0.8	$0.08	90%
Claude 3 Opus	$15	$1.5	90%
Gemini 3 Ultra	$10	$2.5	75%
Gemini 3 Pro	$3.5	$0.875	75%
Gemini 3 Flash	$0.5	$0.125	75%
Gemini 2.5 Pro	$1.25	$0.31	75%
Gemini 2.5 Flash	$0.3	$0.075	75%
Gemini 2.0 Flash	$0.1	$0.025	75%
Gemini 1.5 Pro	$1.25	$0.31	75%
DeepSeek R2	$0.8	$0.2	75%
DeepSeek R1	$0.55	$0.14	75%
DeepSeek Chat	$0.27	$0.07	74%

Platform-Specific Details

Anthropic (Claude)

Anthropic requires explicit cache control markers in your API request. You must add cache_control: { type: 'ephemeral' } to the content blocks you want cached. The cache TTL is 5 minutes by default (extendable). Minimum cacheable block size: 1,024 tokens.

Anthropic's cached pricing is among the best: Claude Haiku drops to $0.10/1M (90% off), Sonnet to $0.30/1M (90% off), and Opus to $0.50/1M (90% off) — the highest caching discount in the industry.

OpenAI (GPT-4o, GPT-4.1)

OpenAI uses automatic caching — no code changes required. Prompts that are ≥1,024 tokens and repeat frequently are cached automatically. Cached tokens are charged at 50% of the standard rate. The cache key is the exact prefix, so even a single character change invalidates caching for everything after that point.

Put stable content (system prompt, tool definitions, static documents) at the beginning of your prompt. Dynamic content (user messages, timestamps) should come last.

Google (Gemini)

Gemini offers context caching as an explicit API feature. You create a cache object from your content, then reference it by ID in your requests. Cache TTL is configurable (minimum 1 hour). Cached tokens cost approximately 75% less than standard rates.

Gemini's caching is especially valuable for its 1M context window — you can cache an entire book or codebase and query it repeatedly at reduced cost.

When to Use Caching

✅

System prompts > 1,000 tokens

Long instructions, persona definitions, and behavioral guidelines

✅

RAG retrieved documents

Static document chunks that appear in many queries

✅

Tool definitions

Large function schemas repeated on every agent step

✅

Few-shot examples

Multiple examples in the prompt that don't change

❌

Short system prompts (< 500 tokens)

Not enough tokens to justify caching overhead

❌

Highly variable prompts

If the prefix changes every request, cache hit rate approaches zero

Measuring Cache Hit Rate

Most provider APIs return cache hit information in the response usage object. For Anthropic, look at usage.cache_read_input_tokens vs usage.input_tokens. For OpenAI, check usage.prompt_tokens_details.cached_tokens.

A healthy cache hit rate for a RAG pipeline is 60–80%. For a chatbot with a fixed system prompt, you should see 85–95% of system prompt tokens served from cache after the first request.