Pricing note: the cost figures below assume legacy Anthropic list prices for illustrative purposes. On Claudexia, all Claude Opus and GPT-5 models bill at $0.50 / $0.50 per 1M and all Sonnet & Haiku models at $0.33 / $0.33 per 1M. Use the techniques here, but expect real bills 5–30× lower.
Why Claude API costs spiral
Every dollar you spend on the Claude API is a function of two numbers: input tokens and output tokens. Most teams obsess over prompt length while ignoring the real cost driver — output.
Here is the pricing as of April 2026 for the three main models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Output / Input ratio |
|---|---|---|---|
| Claude Opus 4.5/4.7 | $0.50 | $0.50 | 5× |
| Claude Sonnet 4.5/4.6 | $0.33 | $0.33 | 5× |
| Claude Haiku 4.5 | $0.33 | $0.33 | 5× |
The 5× multiplier on output is consistent across the lineup. That means a 500-token response costs as much as a 2,500-token input. Teams that generate long-form content, multi-step reasoning chains, or verbose JSON without controlling output length see their bills explode.
A typical startup running 10 million Sonnet requests per month with an average of 1,000 input tokens and 500 output tokens pays:
- Input: 10M × 1,000 / 1M × $3.00 = $30,000
- Output: 10M × 500 / 1M × $15.00 = $75,000
- Total: $105,000/month
Output is 71% of the bill. Every strategy below either reduces token count, shifts tokens to a cheaper tier, or both.
Strategy 1: Prompt caching — 90% off repeated context
If your system prompt, few-shot examples, or RAG context stays the same across multiple requests, you are paying full price for the same tokens over and over. Anthropic's prompt caching stores those tokens server-side and charges only 10% of the normal input price on subsequent hits.
How it works
You mark one or more blocks in your messages with cache_control: {'type': 'ephemeral'}. On the first request, you pay a small write premium (25% extra). On every subsequent request within the cache TTL (currently 5 minutes, extended each time the cache is hit), those tokens are read at a 90% discount.
Python example
import anthropic
client = anthropic.Anthropic()
# A 4,000-token system prompt — cached after first call
system_blocks = [
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # your 4k-token instructions
"cache_control": {"type": "ephemeral"},
}
]
def ask(question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=1024,
system=system_blocks,
messages=[{"role": "user", "content": question}],
)
# Check cache performance
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
return response.content[0].text
TypeScript example
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const systemBlocks = [
{
type: "text" as const,
text: LARGE_SYSTEM_PROMPT,
cache_control: { type: "ephemeral" as const },
},
];
async function ask(question: string): Promise<string> {
const response = await client.messages.create({
model: "claude-sonnet-4.5",
max_tokens: 1024,
system: systemBlocks,
messages: [{ role: "user", content: question }],
});
console.log("Cache read:", response.usage.cache_read_input_tokens);
return response.content[0].type === "text" ? response.content[0].text : "";
}
Savings math
If your system prompt is 4,000 tokens and you make 1,000 requests per hour on Sonnet:
- Without caching: 4,000 × 1,000 × $3 / 1M = $12.00/hour
- With caching: first request at $3.75/M (write premium), remaining 999 at $0.30/M = ~$1.21/hour
- Savings: ~90%
Caching is the single highest-ROI optimization. If you do nothing else, do this.
Strategy 2: Model routing — right model for the right task
Not every request needs Opus. Not every request can survive on Haiku. The idea is simple: classify the incoming task and route it to the cheapest model that can handle it.
A practical split looks like this:
| Task type | Model | Why |
|---|---|---|
| Classification, tagging, extraction | Haiku | Fast, cheap, >95% accuracy on structured tasks |
| Summarization, generation, RAG answers | Sonnet | Best cost/quality ratio for open-ended generation |
| Complex reasoning, code generation, agentic loops | Opus | Worth the premium only when quality directly impacts revenue |
Python router example
import anthropic
client = anthropic.Anthropic()
MODEL_MAP = {
"classify": "claude-haiku-4.5",
"generate": "claude-sonnet-4.5",
"reason": "claude-opus-4.5",
}
def route_and_call(task_type: str, prompt: str, max_tokens: int = 1024) -> str:
model = MODEL_MAP.get(task_type, "claude-sonnet-4.5")
response = client.messages.create(
model=model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
# Classification — runs on Haiku at $0.80/M input
label = route_and_call("classify", "Classify this ticket: 'My order is late'")
# Generation — runs on Sonnet at $3/M input
summary = route_and_call("generate", f"Summarize this document: {doc}")
# Hard reasoning — runs on Opus at $15/M input
analysis = route_and_call("reason", f"Analyze this contract for risks: {contract}")
Savings math
Suppose 60% of your traffic is classification, 30% is generation, 10% is reasoning. Compared to running everything on Sonnet:
| Scenario | Monthly cost (10M requests) |
|---|---|
| All Sonnet | $105,000 |
| Routed (60% Haiku, 30% Sonnet, 10% Opus) | $55,200 |
| Savings | ~47% |
If you combine routing with caching, the savings compound.
Strategy 3: Batch API — 50% off for non-real-time workloads
Any workload that does not need a synchronous response — nightly classification jobs, eval sweeps, dataset labeling, content moderation — should use the Batch API. You submit a JSONL file of requests, Anthropic processes them within 24 hours, and you pay exactly half the normal price on both input and output tokens.
Python example
import anthropic
import json
client = anthropic.Anthropic()
# Build requests
requests = []
for i, item in enumerate(dataset):
requests.append({
"custom_id": f"item-{i}",
"params": {
"model": "claude-sonnet-4.5",
"max_tokens": 256,
"messages": [
{"role": "user", "content": f"Classify: {item['text']}"}
],
},
})
# Submit batch
batch = client.batches.create(requests=requests)
print(f"Batch ID: {batch.id}, status: {batch.processing_status}")
# Poll for completion
import time
while True:
status = client.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(60)
# Download results
for result in client.batches.results(batch.id):
print(result.custom_id, result.result.message.content[0].text)
When to use Batch API
- Eval runs across hundreds or thousands of test cases
- Overnight classification or labeling of large datasets
- Monthly content moderation sweeps
- Synthetic data generation for training
When NOT to use it
- Anything user-facing that requires
<5slatency - Streaming chat interfaces
- Agentic tool-use loops where the next step depends on the previous response
Batch API stacks with prompt caching. You can get a 90% cache discount on input tokens plus the 50% batch discount on the remaining tokens. This is the absolute cheapest way to run Claude at scale.
Strategy 4: Token budgeting — stop paying for tokens you don't need
The simplest cost optimization is generating fewer tokens. Three techniques:
1. Set max_tokens aggressively
If you know the answer is a single word or a short JSON object, do not leave max_tokens at 4096. Set it to the minimum plausible length. Claude stops generating when it hits the limit, and you stop paying.
# Bad: paying for up to 4096 output tokens
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=4096,
messages=[{"role": "user", "content": "Is this email spam? Answer yes or no."}],
)
# Good: paying for at most 16 output tokens
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=16,
messages=[{"role": "user", "content": "Is this email spam? Answer yes or no."}],
)
2. Use stop sequences
Tell Claude to stop when it emits a delimiter. This is especially useful when generating structured data:
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=512,
stop_sequences=["</answer>"],
messages=[{"role": "user", "content": "Answer in <answer> tags."}],
)
3. Stream and terminate early
With streaming, you can read tokens as they arrive and cancel the request the moment you have what you need:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function classifyWithEarlyStop(text: string): Promise<string> {
let result = "";
const stream = client.messages.stream({
model: "claude-sonnet-4.5",
max_tokens: 64,
messages: [{ role: "user", content: `Classify: ${text}` }],
});
for await (const event of stream) {
if (
event.type === "content_block_delta" &&
event.delta.type === "text_delta"
) {
result += event.delta.text;
if (result.includes("\n")) {
await stream.abort(); // stop paying
break;
}
}
}
return result.trim();
}
These three techniques together can cut output token spend by 30-60% on classification and extraction workloads.
Strategy 5: Response prefilling — reduce output tokens by pre-filling structure
Claude's API lets you add an assistant message at the end of the message array to "pre-fill" the beginning of the response. The model continues from where you left off. The prefilled tokens count as input (cheap), not output (expensive).
Example: JSON extraction
response = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=256,
messages=[
{"role": "user", "content": f"Extract name, email, company from: {text}"},
{"role": "assistant", "content": '{"name": "'},
],
)
# Claude continues: John Doe", "email": "john@example.com", "company": "Acme"}
# The '{"name": "' prefix was charged at input rates, not output rates
Why this matters
For a JSON response that is 200 tokens total, if you prefill 50 tokens of structure (keys, braces, formatting), you shift 25% of the output cost to the input tier — a 4× reduction on those tokens.
Prefilling also improves reliability: Claude is far less likely to wrap the response in markdown fences or add preamble text when it is already mid-JSON.
Strategy 6: Structured outputs with tool use
Instead of asking Claude to generate JSON as free text and then parsing it (which is fragile and generates extra tokens for explanations), use tool use (function calling) to force structured output.
Python example
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "classify_ticket",
"description": "Classify a support ticket",
"input_schema": {
"type": "object",
"properties": {
"category": {
"type": "string",
"enum": ["billing", "technical", "general", "urgent"],
},
"priority": {
"type": "integer",
"minimum": 1,
"maximum": 5,
},
"summary": {
"type": "string",
"maxLength": 100,
},
},
"required": ["category", "priority", "summary"],
},
}
]
response = client.messages.create(
model="claude-haiku-4.5",
max_tokens=256,
tools=tools,
tool_choice={"type": "tool", "name": "classify_ticket"},
messages=[
{"role": "user", "content": f"Classify this ticket: {ticket_text}"}
],
)
# Response is guaranteed valid JSON matching the schema
tool_block = next(b for b in response.content if b.type == "tool_use")
result = tool_block.input # {'category': 'billing', 'priority': 3, 'summary': '...'}
TypeScript example
const response = await client.messages.create({
model: "claude-haiku-4.5",
max_tokens: 256,
tools: [
{
name: "classify_ticket",
description: "Classify a support ticket",
input_schema: {
type: "object",
properties: {
category: {
type: "string",
enum: ["billing", "technical", "general", "urgent"],
},
priority: { type: "integer", minimum: 1, maximum: 5 },
summary: { type: "string", maxLength: 100 },
},
required: ["category", "priority", "summary"],
},
},
],
tool_choice: { type: "tool", name: "classify_ticket" },
messages: [{ role: "user", content: `Classify this ticket: ${ticketText}` }],
});
const toolBlock = response.content.find((b) => b.type === "tool_use");
const result = toolBlock?.input;
Benefits over free-text JSON:
- Guaranteed valid JSON — no parsing errors, no retry loops
- Fewer output tokens — Claude does not add explanations, caveats, or markdown wrappers
- Schema enforcement — enums, required fields, and types are validated by the API
Teams that switch from "generate JSON in text" to tool use typically see a 20-40% reduction in output tokens on extraction tasks.
Strategy 7: Use a cost-effective gateway like Claudexia
All six strategies above reduce the tokens you consume. The seventh strategy reduces the price you pay per token — or at least, eliminates unnecessary surcharges.
When you access the Claude API directly through Anthropic, you need:
- A non-Russian credit card or billing address
- To deal with USD invoicing and currency conversion fees
- To meet any minimum spend or prepayment requirements on certain tiers
Claudexia is an OpenAI-compatible gateway that proxies to Anthropic's official API at matching rates with zero markup. It adds:
- Local currency payments — pay in RUB via SBP, cards, or crypto
- No minimum spend — pay as you go, even for small projects
- OpenAI SDK compatibility — switch your base URL and keep your existing code
- All models available — Haiku, Sonnet, Opus, including latest releases
from openai import OpenAI
# Point your existing OpenAI SDK at Claudexia
client = OpenAI(
base_url="https://claudexia.tech/v1",
api_key="your-claudexia-key",
)
response = client.chat.completions.create(
model="claude-sonnet-4.5",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello from Claudexia!"}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://claudexia.tech/v1",
apiKey: "your-claudexia-key",
});
const response = await client.chat.completions.create({
model: "claude-sonnet-4.5",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello from Claudexia!" }],
});
console.log(response.choices[0].message.content);
This is not about getting a "cheaper" API — Claudexia matches Anthropic pricing. It is about removing friction: currency conversion overhead, billing complexity, and geographic restrictions that add invisible cost to your workflow.
Real-world cost calculation: before vs after
Let us take the startup from the introduction — 10 million Sonnet requests/month, 1,000 input tokens and 500 output tokens average — and apply every strategy.
| Optimization | Input cost | Output cost | Total | Savings vs baseline |
|---|---|---|---|---|
| Baseline (no optimization) | $30,000 | $75,000 | $105,000 | — |
| + Prompt caching (70% of input cached) | $11,700 | $75,000 | $86,700 | 17% |
| + Model routing (60% to Haiku) | $5,148 | $33,000 | $38,148 | 64% |
| + Token budgeting (avg output → 300 tokens) | $5,148 | $19,800 | $24,948 | 76% |
| + Batch API on 30% of traffic | $4,378 | $16,830 | $21,208 | 80% |
| + Prefilling & structured outputs (−20% output) | $4,378 | $13,464 | $17,842 | 83% |
From $105,000 to under $18,000. That is an 83% reduction — and we have not even touched prompt engineering to shorten inputs.
The exact numbers depend on your workload mix, but the pattern holds: stack multiple strategies and the savings compound multiplicatively.
FAQ
How does prompt caching pricing work exactly?
Cache writes cost 25% more than normal input tokens (so $3.75/M on Sonnet instead of $3.00/M). Cache reads cost 90% less ($0.30/M on Sonnet). The cache lives for 5 minutes and the TTL resets on every hit. If your traffic is steady — more than one request per 5 minutes using the same cached prefix — you save money on the very first request cycle.
Can I combine prompt caching with the Batch API?
Yes. Cache discounts apply to the cached portion of input tokens, and the 50% batch discount applies to everything else. They stack multiplicatively. A cached token in a batch request costs roughly 95% less than a regular input token.
What is the minimum cache size?
Anthropic requires at least 1,024 tokens in a cache block (2,048 for Opus). If your system prompt is shorter than that, caching will not activate. Pad with few-shot examples or detailed instructions to cross the threshold.
How do I decide which model to route a request to?
Start with a simple heuristic: if the task has a fixed output schema (classification, extraction, yes/no), use Haiku. If it requires open-ended generation with moderate quality, use Sonnet. Reserve Opus for tasks where errors have high cost — legal analysis, complex code generation, multi-step agentic reasoning. Measure accuracy on your eval set and adjust thresholds.
Does response prefilling work with tool use?
No. Prefilling (adding a partial assistant message) and tool_choice are separate mechanisms. Use prefilling for free-text JSON outputs and tool use for schema-enforced structured outputs. Do not combine them in the same request.
What happens if my batch job fails partway through?
The Batch API processes each request independently. If 5 out of 10,000 requests fail (e.g., due to content policy), the other 9,995 succeed normally. Failed requests include error details in the response JSONL. You only pay for successful requests.
Is there a rate limit difference between sync and batch?
Batch requests do not count against your real-time rate limits. They run in a separate queue with higher aggregate throughput. This makes batch ideal for burst workloads that would otherwise require rate limit increases.
How does Claudexia compare to using Anthropic directly?
Claudexia proxies to Anthropic's official API, so you get identical model behavior, latency, and features — including caching, batching, and tool use. The difference is operational: Claudexia accepts local payment methods (SBP, Russian cards, crypto), has no minimum spend, and provides an OpenAI-compatible endpoint so you can switch with a one-line base URL change.
Every strategy in this guide is production-ready and can be implemented in an afternoon. Start with prompt caching (highest ROI, lowest effort), add model routing, and layer in batching and token budgeting as your usage grows.
Ready to start saving? Get your Claudexia API key and apply these optimizations today — with local currency payments and zero markup on Anthropic pricing.