prompt cachingcost optimizationtutorial

Prompt Caching: Save Up to 90% on LLM API Costs

Everything you need to know about prompt caching across Anthropic, OpenAI, and Google — how it works, when to use it, and how much you save.

TTokenCost Editorial·LLM Cost Research·Updated 2026-04-225 min read

Prompt caching is the single highest-ROI optimization for most LLM workloads. By reusing computed context across requests, you can reduce input token costs by 50–90%. Anthropic, OpenAI, Google, and DeepSeek all offer some form of caching, but they work differently.

This guide covers how caching works on each platform, when to use it, and how to measure your actual savings.

How Prompt Caching Works

LLMs compute a key-value cache (KV cache) as they process your prompt. Normally, this is discarded after each request. With prompt caching, providers store this computed cache and reuse it for subsequent requests that share the same prefix.

The key insight: if your requests share a long common prefix (system prompt + static documents), you only pay for full processing once. Subsequent requests with the same prefix pay a fraction of the normal rate.

Pricing by Provider

ModelStandard /1MCached /1MSavings
GPT-5$8$450%
GPT-5 Mini$0.6$0.350%
GPT-4.1$2$150%
o1$15$7.550%
GPT-4o$2.5$1.2550%
Claude Fable 5$10$190%
Claude Opus 4.8$5$0.590%
Claude Opus 4.7$5$0.590%
Claude Sonnet 4.6$3$0.390%
Claude Opus 4.5$15$1.590%
Claude Haiku 4.5$1$0.190%
Claude 3.5 Sonnet$3$0.390%
Claude 3.5 Haiku$0.8$0.0890%
Claude 3 Opus$15$1.590%
Gemini 3 Ultra$10$2.575%
Gemini 3 Pro$3.5$0.87575%
Gemini 3 Flash$0.5$0.12575%
Gemini 2.5 Pro$1.25$0.3175%
Gemini 2.5 Flash$0.3$0.07575%
Gemini 2.0 Flash$0.1$0.02575%
Gemini 1.5 Pro$1.25$0.3175%
DeepSeek R2$0.8$0.275%
DeepSeek R1$0.55$0.1475%
DeepSeek Chat$0.27$0.0774%

Platform-Specific Details

Anthropic (Claude)

Anthropic requires explicit cache control markers in your API request. You must add cache_control: { type: 'ephemeral' } to the content blocks you want cached. The cache TTL is 5 minutes by default (extendable). Minimum cacheable block size: 1,024 tokens.

Anthropic's cached pricing is among the best: Claude Haiku drops to $0.10/1M (90% off), Sonnet to $0.30/1M (90% off), and Opus to $0.50/1M (90% off) — the highest caching discount in the industry.

OpenAI (GPT-4o, GPT-4.1)

OpenAI uses automatic caching — no code changes required. Prompts that are ≥1,024 tokens and repeat frequently are cached automatically. Cached tokens are charged at 50% of the standard rate. The cache key is the exact prefix, so even a single character change invalidates caching for everything after that point.

Put stable content (system prompt, tool definitions, static documents) at the beginning of your prompt. Dynamic content (user messages, timestamps) should come last.

Google (Gemini)

Gemini offers context caching as an explicit API feature. You create a cache object from your content, then reference it by ID in your requests. Cache TTL is configurable (minimum 1 hour). Cached tokens cost approximately 75% less than standard rates.

Gemini's caching is especially valuable for its 1M context window — you can cache an entire book or codebase and query it repeatedly at reduced cost.

When to Use Caching

System prompts > 1,000 tokens
Long instructions, persona definitions, and behavioral guidelines
RAG retrieved documents
Static document chunks that appear in many queries
Tool definitions
Large function schemas repeated on every agent step
Few-shot examples
Multiple examples in the prompt that don't change
Short system prompts (< 500 tokens)
Not enough tokens to justify caching overhead
Highly variable prompts
If the prefix changes every request, cache hit rate approaches zero

Measuring Cache Hit Rate

Most provider APIs return cache hit information in the response usage object. For Anthropic, look at usage.cache_read_input_tokens vs usage.input_tokens. For OpenAI, check usage.prompt_tokens_details.cached_tokens.

A healthy cache hit rate for a RAG pipeline is 60–80%. For a chatbot with a fixed system prompt, you should see 85–95% of system prompt tokens served from cache after the first request.

Real-World Impact

Consider a RAG pipeline running on Claude Sonnet 4.6 with:

  • 3,500 input tokens per request (500 system prompt + 3,000 retrieved chunks)
  • 500 output tokens
  • 10,000 requests/day
  • 75% of input tokens served from cache
Without caching$3,375/month
With 75% cache hit rate$1,069/month
Savings: $2,306/month (68% reduction)

Use our token cost calculator with the caching toggle to model your own workload, or see the full model comparison →

Related Articles

Cheapest LLM API in 2026: Full Price Comparison
We compared 26 LLM models across 8 providers to find the cheapest API for every use case — from bulk processing to complex reasoning.
8 min read
7 Ways to Reduce Your OpenAI API Cost by 80%
Practical techniques to dramatically cut your OpenAI API bill: prompt caching, model routing, batch API, and token optimization strategies.
6 min read
Llama 4 API Cost Guide: Maverick vs Scout vs Self-Hosting
Meta Llama 4 pricing explained — Maverick vs Scout, hosted API vs self-hosting economics, and when Llama 3.1 8B is still the cheapest capable option.
5 min read