If you are running Claude in production in 2026 and you have not turned on prompt caching, you are almost certainly overpaying — often by an order of magnitude. Anthropic's caching feature is the single biggest lever you have on the Claude API bill, and unlike most "optimization tricks" it is transparent, deterministic, and ships with the official SDK. This post walks through what prompt caching actually is, the pricing math that makes it work, the patterns where it pays off, and the pitfalls that bite teams the first month after rollout.
What prompt caching actually is
Prompt caching lets you mark prefix segments of a request as cacheable. On the first call, Anthropic stores the model's internal representation of that prefix on their infrastructure. On any subsequent call within a short TTL (5 minutes by default, refreshed on every hit), the same prefix is served from the cache instead of being re-encoded from scratch.
You opt in by adding a cache_control breakpoint to a content block. Every
content block at or before that breakpoint becomes part of the cache key.
You can place up to four breakpoints per request, which gives you
hierarchical caching — for example, a stable system prompt, then stable
tool definitions, then a moving conversation history.
The cache is scoped per-organization and per-exact-prefix. Change a single token in the cached prefix and the entry is invalidated, which is the behaviour you want — you never get stale context — but it is also the single most common reason teams see lower hit rates than expected.
The pricing math
Caching is not free. Anthropic charges three rates against the same input token:
- Standard input — 1.0× the model's listed input price.
- Cache write — 1.25× input price (you pay a 25% premium to populate the cache).
- Cache read — 0.1× input price (you pay 10% of the listed rate when the cache hits).
Run the break-even on any cached prefix: a single read at 0.1× plus the 1.25× write only beats two cold calls at 1.0× each after two reuses. From the third hit onward, every call on that prefix is essentially free relative to a non-cached baseline.
For a concrete example, take a 20,000-token system prompt on Sonnet 4.6 at roughly $3 per 1M input tokens:
- Cold call: 20,000 × $3 / 1M = $0.060 per call.
- Cache write: 20,000 × $0.50 / 1M = $0.01 (one-time per TTL).
- Cache read: 20,000 × $0.05 / 1M = $0.001 per call.
Ten cached reads in a 5-minute window cost $0.075 + 9 × $0.006 = $0.129 versus 10 × $0.060 = $0.600 uncached. That is a 78% reduction at ten reuses, climbing toward the theoretical 90% ceiling as reuse density grows.
Where caching actually pays off
In production we see four patterns that consistently clear the break-even bar:
- Long system prompts. Anything north of a few thousand tokens — detailed personas, formatting rules, brand voice, style guides — should be cached. The system prompt is, by definition, identical across every request.
- RAG with a stable knowledge base. If you stuff a fixed document corpus into the prompt (product docs, policies, schema), cache it. Variable user queries go after the breakpoint.
- Agent tool definitions. Function/tool JSON schemas can balloon past 10K tokens for a non-trivial agent. They almost never change between calls. Cache them.
- Multi-turn assistants with long history. Cache up through the last fully-stable turn. Each new user message extends the prefix; the prior history is reused.
Code: Python and TypeScript via Claudexia
Claudexia exposes Claude through an OpenAI-compatible endpoint, and prompt
caching passes through transparently — there is no extra config, no
proprietary header, and no rewrite. You set cache_control exactly as you
would against Anthropic's native API.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_CLAUDEXIA_KEY",
base_url="https://api.claudexia.tech/v1",
)
SYSTEM_PROMPT = open("system_prompt.txt").read() # ~20K tokens
response = client.chat.completions.create(
model="claude-sonnet-4.6",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
},
{"role": "user", "content": "Summarize the latest sales report."},
],
)
usage = response.usage
print(usage.cache_creation_input_tokens, usage.cache_read_input_tokens)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.CLAUDEXIA_KEY,
baseURL: "https://api.claudexia.tech/v1",
});
const response = await client.chat.completions.create({
model: "claude-sonnet-4.6",
messages: [
{
role: "system",
content: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
},
{ role: "user", content: "Summarize the latest sales report." },
],
});
console.log(response.usage);
The usage object reports cache_creation_input_tokens and
cache_read_input_tokens separately from regular input tokens, which is
how you verify caching is actually firing — log these in production from
day one.
Real production numbers
A customer running an internal support agent on Claudexia's gateway moved
from a 22,000-token system prompt with no caching to the same prompt
behind a single cache_control breakpoint. Pre-caching, the system
segment alone cost roughly $1,980/month at ~30 calls/hour. Post-
caching, with the prefix hitting on average 14 times per 5-minute window,
the same line item dropped to $215/month — a 89% reduction, with
zero changes to model output or quality. Tool definitions cached behind a
second breakpoint shaved another ~$140/month. End-to-end the agent's
Claude bill fell from roughly $2,400/month to under $400.
That ratio is not unusual. Any workload with a stable prefix and at least a handful of calls per 5-minute window will see 70–90% input savings once caching is wired up correctly.
Pitfalls
A few things bite people on the first rollout:
- The 5-minute TTL is a sliding window, not absolute. Every hit refreshes it. But if traffic dies for 5 full minutes, the next call pays the 1.25× write again. For low-frequency workloads this can silently erase the savings — measure your inter-arrival distribution.
- Four breakpoints is the hard limit. Use them hierarchically: system → tools → static context → recent history. Don't waste breakpoints on sub-segments that always change together.
- Cache invalidation on any prefix change. A single edit to the system prompt — even whitespace — invalidates the entry. Treat your cached prefixes like immutable build artifacts and version them explicitly.
- Per-block minimum sizes. Very small cached blocks (under ~1K tokens for most models) won't be cached at all; the SDK silently skips them. Don't try to cache trivial content.
When not to cache
Caching is a poor fit for a few cases:
- Single-shot calls. If the prompt will never be reused within 5 minutes, you are paying the 1.25× write premium for nothing.
- Short prompts. Below the per-block minimum, caching is a no-op.
- Highly variable system text. If you template per-user data into what looks like a system prompt, every call is a cold write. Move the variable bits below the breakpoint, or cache the truly stable shell and accept that the personalized layer pays full price.
Claudexia and caching
Claudexia routes prompt caching through to Anthropic transparently — same
breakpoints, same TTL, same usage accounting in the response. There is
no proprietary "caching mode" to enable, and our pricing matches the
Anthropic write/read multipliers exactly. If you are already caching
against Anthropic directly, switching the base_url to
https://api.claudexia.tech/v1 keeps every saving you have today.
If you have not enabled caching yet, do it this week. It is the highest- ROI change you will make to a Claude workload all year.