Skip to content
COST OPTIMIZATION

Cut Claude API Costs 90% with Prompt Caching: 2026 Production Guide

Anthropic's prompt caching reduces input cost up to 90% for repeated context. Real numbers, code samples, and patterns that actually save money in production.

If you are running Claude in production in 2026 and you have not turned on prompt caching, you are almost certainly overpaying — often by an order of magnitude. Anthropic's caching feature is the single biggest lever you have on the Claude API bill, and unlike most "optimization tricks" it is transparent, deterministic, and ships with the official SDK. This post walks through what prompt caching actually is, the pricing math that makes it work, the patterns where it pays off, and the pitfalls that bite teams the first month after rollout.

What prompt caching actually is

Prompt caching lets you mark prefix segments of a request as cacheable. On the first call, Anthropic stores the model's internal representation of that prefix on their infrastructure. On any subsequent call within a short TTL (5 minutes by default, refreshed on every hit), the same prefix is served from the cache instead of being re-encoded from scratch.

You opt in by adding a cache_control breakpoint to a content block. Every content block at or before that breakpoint becomes part of the cache key. You can place up to four breakpoints per request, which gives you hierarchical caching — for example, a stable system prompt, then stable tool definitions, then a moving conversation history.

The cache is scoped per-organization and per-exact-prefix. Change a single token in the cached prefix and the entry is invalidated, which is the behaviour you want — you never get stale context — but it is also the single most common reason teams see lower hit rates than expected.

The pricing math

Caching is not free. Anthropic charges three rates against the same input token:

  • Standard input — 1.0× the model's listed input price.
  • Cache write — 1.25× input price (you pay a 25% premium to populate the cache).
  • Cache read — 0.1× input price (you pay 10% of the listed rate when the cache hits).

Run the break-even on any cached prefix: a single read at 0.1× plus the 1.25× write only beats two cold calls at 1.0× each after two reuses. From the third hit onward, every call on that prefix is essentially free relative to a non-cached baseline.

For a concrete example, take a 20,000-token system prompt on Sonnet 4.6 at roughly $3 per 1M input tokens:

  • Cold call: 20,000 × $3 / 1M = $0.060 per call.
  • Cache write: 20,000 × $0.50 / 1M = $0.01 (one-time per TTL).
  • Cache read: 20,000 × $0.05 / 1M = $0.001 per call.

Ten cached reads in a 5-minute window cost $0.075 + 9 × $0.006 = $0.129 versus 10 × $0.060 = $0.600 uncached. That is a 78% reduction at ten reuses, climbing toward the theoretical 90% ceiling as reuse density grows.

Where caching actually pays off

In production we see four patterns that consistently clear the break-even bar:

  1. Long system prompts. Anything north of a few thousand tokens — detailed personas, formatting rules, brand voice, style guides — should be cached. The system prompt is, by definition, identical across every request.
  2. RAG with a stable knowledge base. If you stuff a fixed document corpus into the prompt (product docs, policies, schema), cache it. Variable user queries go after the breakpoint.
  3. Agent tool definitions. Function/tool JSON schemas can balloon past 10K tokens for a non-trivial agent. They almost never change between calls. Cache them.
  4. Multi-turn assistants with long history. Cache up through the last fully-stable turn. Each new user message extends the prefix; the prior history is reused.

Code: Python and TypeScript via Claudexia

Claudexia exposes Claude through an OpenAI-compatible endpoint, and prompt caching passes through transparently — there is no extra config, no proprietary header, and no rewrite. You set cache_control exactly as you would against Anthropic's native API.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_CLAUDEXIA_KEY",
    base_url="https://api.claudexia.tech/v1",
)

SYSTEM_PROMPT = open("system_prompt.txt").read()  # ~20K tokens

response = client.chat.completions.create(
    model="claude-sonnet-4.6",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": SYSTEM_PROMPT,
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize the latest sales report."},
    ],
)

usage = response.usage
print(usage.cache_creation_input_tokens, usage.cache_read_input_tokens)
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.CLAUDEXIA_KEY,
  baseURL: "https://api.claudexia.tech/v1",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet-4.6",
  messages: [
    {
      role: "system",
      content: [
        {
          type: "text",
          text: SYSTEM_PROMPT,
          cache_control: { type: "ephemeral" },
        },
      ],
    },
    { role: "user", content: "Summarize the latest sales report." },
  ],
});

console.log(response.usage);

The usage object reports cache_creation_input_tokens and cache_read_input_tokens separately from regular input tokens, which is how you verify caching is actually firing — log these in production from day one.

Real production numbers

A customer running an internal support agent on Claudexia's gateway moved from a 22,000-token system prompt with no caching to the same prompt behind a single cache_control breakpoint. Pre-caching, the system segment alone cost roughly $1,980/month at ~30 calls/hour. Post- caching, with the prefix hitting on average 14 times per 5-minute window, the same line item dropped to $215/month — a 89% reduction, with zero changes to model output or quality. Tool definitions cached behind a second breakpoint shaved another ~$140/month. End-to-end the agent's Claude bill fell from roughly $2,400/month to under $400.

That ratio is not unusual. Any workload with a stable prefix and at least a handful of calls per 5-minute window will see 70–90% input savings once caching is wired up correctly.

Pitfalls

A few things bite people on the first rollout:

  • The 5-minute TTL is a sliding window, not absolute. Every hit refreshes it. But if traffic dies for 5 full minutes, the next call pays the 1.25× write again. For low-frequency workloads this can silently erase the savings — measure your inter-arrival distribution.
  • Four breakpoints is the hard limit. Use them hierarchically: system → tools → static context → recent history. Don't waste breakpoints on sub-segments that always change together.
  • Cache invalidation on any prefix change. A single edit to the system prompt — even whitespace — invalidates the entry. Treat your cached prefixes like immutable build artifacts and version them explicitly.
  • Per-block minimum sizes. Very small cached blocks (under ~1K tokens for most models) won't be cached at all; the SDK silently skips them. Don't try to cache trivial content.

When not to cache

Caching is a poor fit for a few cases:

  • Single-shot calls. If the prompt will never be reused within 5 minutes, you are paying the 1.25× write premium for nothing.
  • Short prompts. Below the per-block minimum, caching is a no-op.
  • Highly variable system text. If you template per-user data into what looks like a system prompt, every call is a cold write. Move the variable bits below the breakpoint, or cache the truly stable shell and accept that the personalized layer pays full price.

Claudexia and caching

Claudexia routes prompt caching through to Anthropic transparently — same breakpoints, same TTL, same usage accounting in the response. There is no proprietary "caching mode" to enable, and our pricing matches the Anthropic write/read multipliers exactly. If you are already caching against Anthropic directly, switching the base_url to https://api.claudexia.tech/v1 keeps every saving you have today.

If you have not enabled caching yet, do it this week. It is the highest- ROI change you will make to a Claude workload all year.