7 Strategies to Cut Claude API Costs by 50-90% in 2026

Q: How does prompt caching pricing work exactly?

Cache writes cost 25% more than normal input tokens (so $3.75/M on Sonnet instead of $3.00/M). Cache reads cost 90% less ($0.30/M on Sonnet). The cache lives for 5 minutes and the TTL resets on every hit. If your traffic is steady — more than one request per 5 minutes using the same cached prefix — you save money on the very first request cycle.

Q: Can I combine prompt caching with the Batch API?

Yes. Cache discounts apply to the cached portion of input tokens, and the 50% batch discount applies to everything else. They stack multiplicatively. A cached token in a batch request costs roughly 95% less than a regular input token.

Q: What is the minimum cache size?

Anthropic requires at least 1,024 tokens in a cache block (2,048 for Opus). If your system prompt is shorter than that, caching will not activate. Pad with few-shot examples or detailed instructions to cross the threshold.

Q: How do I decide which model to route a request to?

Start with a simple heuristic: if the task has a fixed output schema (classification, extraction, yes/no), use Haiku. If it requires open-ended generation with moderate quality, use Sonnet. Reserve Opus for tasks where errors have high cost — legal analysis, complex code generation, multi-step agentic reasoning. Measure accuracy on your eval set and adjust thresholds.

Q: Does response prefilling work with tool use?

No. Prefilling (adding a partial assistant message) and tool_choice are separate mechanisms. Use prefilling for free-text JSON outputs and tool use for schema-enforced structured outputs. Do not combine them in the same request.

Q: What happens if my batch job fails partway through?

The Batch API processes each request independently. If 5 out of 10,000 requests fail (e.g., due to content policy), the other 9,995 succeed normally. Failed requests include error details in the response JSONL. You only pay for successful requests.

Q: Is there a rate limit difference between sync and batch?

Batch requests do not count against your real-time rate limits. They run in a separate queue with higher aggregate throughput. This makes batch ideal for burst workloads that would otherwise require rate limit increases.

Q: How does Claudexia compare to using Anthropic directly?

Claudexia proxies to Anthropic's official API, so you get identical model behavior, latency, and features — including caching, batching, and tool use. The difference is operational: Claudexia accepts local payment methods (SBP, Russian cards, crypto), has no minimum spend, and provides an OpenAI-compatible endpoint so you can switch with a one-line base URL change. --- Every strategy in this guide is production-ready and can be implemented in an afternoon. Start with prompt caching (highest ROI, lowest effort), add model routing, and layer in batching and token budgeting as your usage grows. Ready to start saving? Get your Claudexia API key and apply these optimizations today — with local currency payments and zero markup on Anthropic pricing.

Practical, code-level techniques — prompt caching, model routing, Batch API, token budgeting, response prefilling, structured outputs, and gateway selection — that can reduce your Anthropic bill by half or more.

Pricing note: the cost figures below assume legacy Anthropic list prices for illustrative purposes. On Claudexia, all Claude Opus and GPT-5 models bill at $0.50 / $0.50 per 1M and all Sonnet & Haiku models at $0.33 / $0.33 per 1M. Use the techniques here, but expect real bills 5–30× lower.

Why Claude API costs spiral

Every dollar you spend on the Claude API is a function of two numbers: input tokens and output tokens. Most teams obsess over prompt length while ignoring the real cost driver — output.

Here is the pricing as of April 2026 for the three main models:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Output / Input ratio
Claude Opus 4.5/4.7	$0.50	$0.50	5×
Claude Sonnet 4.5/4.6	$0.33	$0.33	5×
Claude Haiku 4.5	$0.33	$0.33	5×

The 5× multiplier on output is consistent across the lineup. That means a 500-token response costs as much as a 2,500-token input. Teams that generate long-form content, multi-step reasoning chains, or verbose JSON without controlling output length see their bills explode.

A typical startup running 10 million Sonnet requests per month with an average of 1,000 input tokens and 500 output tokens pays:

Input: 10M × 1,000 / 1M × $3.00 = $30,000
Output: 10M × 500 / 1M × $15.00 = $75,000
Total: $105,000/month

Output is 71% of the bill. Every strategy below either reduces token count, shifts tokens to a cheaper tier, or both.

Strategy 1: Prompt caching — 90% off repeated context

If your system prompt, few-shot examples, or RAG context stays the same across multiple requests, you are paying full price for the same tokens over and over. Anthropic's prompt caching stores those tokens server-side and charges only 10% of the normal input price on subsequent hits.

How it works

You mark one or more blocks in your messages with cache_control: {'type': 'ephemeral'}. On the first request, you pay a small write premium (25% extra). On every subsequent request within the cache TTL (currently 5 minutes, extended each time the cache is hit), those tokens are read at a 90% discount.

Python example

import anthropic

client = anthropic.Anthropic()

# A 4,000-token system prompt — cached after first call
system_blocks = [
    {
        "type": "text",
        "text": LARGE_SYSTEM_PROMPT,  # your 4k-token instructions
        "cache_control": {"type": "ephemeral"},
    }
]

def ask(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4.5",
        max_tokens=1024,
        system=system_blocks,
        messages=[{"role": "user", "content": question}],
    )
    # Check cache performance
    usage = response.usage
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
    return response.content[0].text

TypeScript example

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const systemBlocks = [
  {
    type: "text" as const,
    text: LARGE_SYSTEM_PROMPT,
    cache_control: { type: "ephemeral" as const },
  },
];

async function ask(question: string): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4.5",
    max_tokens: 1024,
    system: systemBlocks,
    messages: [{ role: "user", content: question }],
  });
  console.log("Cache read:", response.usage.cache_read_input_tokens);
  return response.content[0].type === "text" ? response.content[0].text : "";
}

Savings math

If your system prompt is 4,000 tokens and you make 1,000 requests per hour on Sonnet:

Without caching: 4,000 × 1,000 × $3 / 1M = $12.00/hour
With caching: first request at $3.75/M (write premium), remaining 999 at $0.30/M = ~$1.21/hour
Savings: ~90%

Caching is the single highest-ROI optimization. If you do nothing else, do this.

Strategy 2: Model routing — right model for the right task

Not every request needs Opus. Not every request can survive on Haiku. The idea is simple: classify the incoming task and route it to the cheapest model that can handle it.

A practical split looks like this:

Task type	Model	Why
Classification, tagging, extraction	Haiku	Fast, cheap, `>95%` accuracy on structured tasks
Summarization, generation, RAG answers	Sonnet	Best cost/quality ratio for open-ended generation
Complex reasoning, code generation, agentic loops	Opus	Worth the premium only when quality directly impacts revenue

Python router example

import anthropic

client = anthropic.Anthropic()

MODEL_MAP = {
    "classify": "claude-haiku-4.5",
    "generate": "claude-sonnet-4.5",
    "reason": "claude-opus-4.5",
}

def route_and_call(task_type: str, prompt: str, max_tokens: int = 1024) -> str:
    model = MODEL_MAP.get(task_type, "claude-sonnet-4.5")
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

# Classification — runs on Haiku at $0.80/M input
label = route_and_call("classify", "Classify this ticket: 'My order is late'")

# Generation — runs on Sonnet at $3/M input
summary = route_and_call("generate", f"Summarize this document: {doc}")

# Hard reasoning — runs on Opus at $15/M input
analysis = route_and_call("reason", f"Analyze this contract for risks: {contract}")

Savings math

Suppose 60% of your traffic is classification, 30% is generation, 10% is reasoning. Compared to running everything on Sonnet:

Scenario	Monthly cost (10M requests)
All Sonnet	$105,000
Routed (60% Haiku, 30% Sonnet, 10% Opus)	$55,200
Savings	~47%

If you combine routing with caching, the savings compound.

Strategy 3: Batch API — 50% off for non-real-time workloads

Any workload that does not need a synchronous response — nightly classification jobs, eval sweeps, dataset labeling, content moderation — should use the Batch API. You submit a JSONL file of requests, Anthropic processes them within 24 hours, and you pay exactly half the normal price on both input and output tokens.

Python example

import anthropic
import json

client = anthropic.Anthropic()

# Build requests
requests = []
for i, item in enumerate(dataset):
    requests.append({
        "custom_id": f"item-{i}",
        "params": {
            "model": "claude-sonnet-4.5",
            "max_tokens": 256,
            "messages": [
                {"role": "user", "content": f"Classify: {item['text']}"}
            ],
        },
    })

# Submit batch
batch = client.batches.create(requests=requests)
print(f"Batch ID: {batch.id}, status: {batch.processing_status}")

# Poll for completion
import time
while True:
    status = client.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

# Download results
for result in client.batches.results(batch.id):
    print(result.custom_id, result.result.message.content[0].text)

When to use Batch API

Eval runs across hundreds or thousands of test cases
Overnight classification or labeling of large datasets
Monthly content moderation sweeps
Synthetic data generation for training

When NOT to use it

Anything user-facing that requires <5s latency
Streaming chat interfaces
Agentic tool-use loops where the next step depends on the previous response

Batch API stacks with prompt caching. You can get a 90% cache discount on input tokens plus the 50% batch discount on the remaining tokens. This is the absolute cheapest way to run Claude at scale.

Strategy 4: Token budgeting — stop paying for tokens you don't need

The simplest cost optimization is generating fewer tokens. Three techniques:

1. Set `max_tokens` aggressively

If you know the answer is a single word or a short JSON object, do not leave max_tokens at 4096. Set it to the minimum plausible length. Claude stops generating when it hits the limit, and you stop paying.

# Bad: paying for up to 4096 output tokens
response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=4096,
    messages=[{"role": "user", "content": "Is this email spam? Answer yes or no."}],
)

# Good: paying for at most 16 output tokens
response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=16,
    messages=[{"role": "user", "content": "Is this email spam? Answer yes or no."}],
)

2. Use stop sequences

Tell Claude to stop when it emits a delimiter. This is especially useful when generating structured data:

response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=512,
    stop_sequences=["</answer>"],
    messages=[{"role": "user", "content": "Answer in <answer> tags."}],
)

3. Stream and terminate early

With streaming, you can read tokens as they arrive and cancel the request the moment you have what you need:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function classifyWithEarlyStop(text: string): Promise<string> {
  let result = "";
  const stream = client.messages.stream({
    model: "claude-sonnet-4.5",
    max_tokens: 64,
    messages: [{ role: "user", content: `Classify: ${text}` }],
  });

  for await (const event of stream) {
    if (
      event.type === "content_block_delta" &&
      event.delta.type === "text_delta"
    ) {
      result += event.delta.text;
      if (result.includes("\n")) {
        await stream.abort(); // stop paying
        break;
      }
    }
  }
  return result.trim();
}

These three techniques together can cut output token spend by 30-60% on classification and extraction workloads.

Strategy 5: Response prefilling — reduce output tokens by pre-filling structure

Claude's API lets you add an assistant message at the end of the message array to "pre-fill" the beginning of the response. The model continues from where you left off. The prefilled tokens count as input (cheap), not output (expensive).

Example: JSON extraction

response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=256,
    messages=[
        {"role": "user", "content": f"Extract name, email, company from: {text}"},
        {"role": "assistant", "content": '{"name": "'},
    ],
)
# Claude continues: John Doe", "email": "john@example.com", "company": "Acme"}
# The '{"name": "' prefix was charged at input rates, not output rates

Why this matters

For a JSON response that is 200 tokens total, if you prefill 50 tokens of structure (keys, braces, formatting), you shift 25% of the output cost to the input tier — a 4× reduction on those tokens.

Prefilling also improves reliability: Claude is far less likely to wrap the response in markdown fences or add preamble text when it is already mid-JSON.

Strategy 6: Structured outputs with tool use

Instead of asking Claude to generate JSON as free text and then parsing it (which is fragile and generates extra tokens for explanations), use tool use (function calling) to force structured output.

Python example

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "classify_ticket",
        "description": "Classify a support ticket",
        "input_schema": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": ["billing", "technical", "general", "urgent"],
                },
                "priority": {
                    "type": "integer",
                    "minimum": 1,
                    "maximum": 5,
                },
                "summary": {
                    "type": "string",
                    "maxLength": 100,
                },
            },
            "required": ["category", "priority", "summary"],
        },
    }
]

response = client.messages.create(
    model="claude-haiku-4.5",
    max_tokens=256,
    tools=tools,
    tool_choice={"type": "tool", "name": "classify_ticket"},
    messages=[
        {"role": "user", "content": f"Classify this ticket: {ticket_text}"}
    ],
)

# Response is guaranteed valid JSON matching the schema
tool_block = next(b for b in response.content if b.type == "tool_use")
result = tool_block.input  # {'category': 'billing', 'priority': 3, 'summary': '...'}

TypeScript example

const response = await client.messages.create({
  model: "claude-haiku-4.5",
  max_tokens: 256,
  tools: [
    {
      name: "classify_ticket",
      description: "Classify a support ticket",
      input_schema: {
        type: "object",
        properties: {
          category: {
            type: "string",
            enum: ["billing", "technical", "general", "urgent"],
          },
          priority: { type: "integer", minimum: 1, maximum: 5 },
          summary: { type: "string", maxLength: 100 },
        },
        required: ["category", "priority", "summary"],
      },
    },
  ],
  tool_choice: { type: "tool", name: "classify_ticket" },
  messages: [{ role: "user", content: `Classify this ticket: ${ticketText}` }],
});

const toolBlock = response.content.find((b) => b.type === "tool_use");
const result = toolBlock?.input;

Benefits over free-text JSON:

Guaranteed valid JSON — no parsing errors, no retry loops
Fewer output tokens — Claude does not add explanations, caveats, or markdown wrappers
Schema enforcement — enums, required fields, and types are validated by the API

Teams that switch from "generate JSON in text" to tool use typically see a 20-40% reduction in output tokens on extraction tasks.

Strategy 7: Use a cost-effective gateway like Claudexia

All six strategies above reduce the tokens you consume. The seventh strategy reduces the price you pay per token — or at least, eliminates unnecessary surcharges.

When you access the Claude API directly through Anthropic, you need:

A non-Russian credit card or billing address
To deal with USD invoicing and currency conversion fees
To meet any minimum spend or prepayment requirements on certain tiers

Claudexia is an OpenAI-compatible gateway that proxies to Anthropic's official API at matching rates with zero markup. It adds:

Local currency payments — pay in RUB via SBP, cards, or crypto
No minimum spend — pay as you go, even for small projects
OpenAI SDK compatibility — switch your base URL and keep your existing code
All models available — Haiku, Sonnet, Opus, including latest releases

from openai import OpenAI

# Point your existing OpenAI SDK at Claudexia
client = OpenAI(
    base_url="https://claudexia.tech/v1",
    api_key="your-claudexia-key",
)

response = client.chat.completions.create(
    model="claude-sonnet-4.5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello from Claudexia!"}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://claudexia.tech/v1",
  apiKey: "your-claudexia-key",
});

const response = await client.chat.completions.create({
  model: "claude-sonnet-4.5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello from Claudexia!" }],
});
console.log(response.choices[0].message.content);

This is not about getting a "cheaper" API — Claudexia matches Anthropic pricing. It is about removing friction: currency conversion overhead, billing complexity, and geographic restrictions that add invisible cost to your workflow.

Real-world cost calculation: before vs after

Let us take the startup from the introduction — 10 million Sonnet requests/month, 1,000 input tokens and 500 output tokens average — and apply every strategy.

Optimization	Input cost	Output cost	Total	Savings vs baseline
Baseline (no optimization)	$30,000	$75,000	$105,000	—
+ Prompt caching (70% of input cached)	$11,700	$75,000	$86,700	17%
+ Model routing (60% to Haiku)	$5,148	$33,000	$38,148	64%
+ Token budgeting (avg output → 300 tokens)	$5,148	$19,800	$24,948	76%
+ Batch API on 30% of traffic	$4,378	$16,830	$21,208	80%
+ Prefilling & structured outputs (−20% output)	$4,378	$13,464	$17,842	83%

From $105,000 to under $18,000. That is an 83% reduction — and we have not even touched prompt engineering to shorten inputs.

The exact numbers depend on your workload mix, but the pattern holds: stack multiple strategies and the savings compound multiplicatively.

FAQ

How does prompt caching pricing work exactly?

Cache writes cost 25% more than normal input tokens (so $3.75/M on Sonnet instead of $3.00/M). Cache reads cost 90% less ($0.30/M on Sonnet). The cache lives for 5 minutes and the TTL resets on every hit. If your traffic is steady — more than one request per 5 minutes using the same cached prefix — you save money on the very first request cycle.

Can I combine prompt caching with the Batch API?

Yes. Cache discounts apply to the cached portion of input tokens, and the 50% batch discount applies to everything else. They stack multiplicatively. A cached token in a batch request costs roughly 95% less than a regular input token.

What is the minimum cache size?

Anthropic requires at least 1,024 tokens in a cache block (2,048 for Opus). If your system prompt is shorter than that, caching will not activate. Pad with few-shot examples or detailed instructions to cross the threshold.

How do I decide which model to route a request to?

Start with a simple heuristic: if the task has a fixed output schema (classification, extraction, yes/no), use Haiku. If it requires open-ended generation with moderate quality, use Sonnet. Reserve Opus for tasks where errors have high cost — legal analysis, complex code generation, multi-step agentic reasoning. Measure accuracy on your eval set and adjust thresholds.

Does response prefilling work with tool use?

No. Prefilling (adding a partial assistant message) and tool_choice are separate mechanisms. Use prefilling for free-text JSON outputs and tool use for schema-enforced structured outputs. Do not combine them in the same request.

What happens if my batch job fails partway through?

The Batch API processes each request independently. If 5 out of 10,000 requests fail (e.g., due to content policy), the other 9,995 succeed normally. Failed requests include error details in the response JSONL. You only pay for successful requests.

Is there a rate limit difference between sync and batch?

Batch requests do not count against your real-time rate limits. They run in a separate queue with higher aggregate throughput. This makes batch ideal for burst workloads that would otherwise require rate limit increases.

How does Claudexia compare to using Anthropic directly?

Claudexia proxies to Anthropic's official API, so you get identical model behavior, latency, and features — including caching, batching, and tool use. The difference is operational: Claudexia accepts local payment methods (SBP, Russian cards, crypto), has no minimum spend, and provides an OpenAI-compatible endpoint so you can switch with a one-line base URL change.

Every strategy in this guide is production-ready and can be implemented in an afternoon. Start with prompt caching (highest ROI, lowest effort), add model routing, and layer in batching and token budgeting as your usage grows.

Ready to start saving? Get your Claudexia API key and apply these optimizations today — with local currency payments and zero markup on Anthropic pricing.

7 Strategies to Cut Claude API Costs by 50-90% in 2026

Why Claude API costs spiral

Strategy 1: Prompt caching — 90% off repeated context

How it works

Python example

TypeScript example

Savings math

Strategy 2: Model routing — right model for the right task

Python router example

Savings math

Strategy 3: Batch API — 50% off for non-real-time workloads

Python example

When to use Batch API

When NOT to use it

Strategy 4: Token budgeting — stop paying for tokens you don't need

1. Set max_tokens aggressively

2. Use stop sequences

3. Stream and terminate early

Strategy 5: Response prefilling — reduce output tokens by pre-filling structure

Example: JSON extraction

Why this matters

Strategy 6: Structured outputs with tool use

Python example

TypeScript example

Strategy 7: Use a cost-effective gateway like Claudexia

Real-world cost calculation: before vs after

FAQ

How does prompt caching pricing work exactly?

Can I combine prompt caching with the Batch API?

What is the minimum cache size?

How do I decide which model to route a request to?

Does response prefilling work with tool use?

What happens if my batch job fails partway through?

Is there a rate limit difference between sync and batch?

How does Claudexia compare to using Anthropic directly?

1. Set `max_tokens` aggressively