If your Claude bill has a long tail of non-interactive workloads — overnight classification, dataset labeling, monthly content moderation sweeps, eval runs — you are leaving roughly half your money on the table by sending those calls through the synchronous Messages API. The Batch API exists for exactly this case: submit a job, wait up to 24 hours, pay 50% less on both input and output tokens.
What the Batch API actually is
The Batch API is an asynchronous job queue. Instead of sending one
request at a time and waiting for a response, you upload a JSONL file
where each line is a complete Messages API request, Anthropic processes
the whole batch within a 24-hour SLA, and you download a JSONL file of
responses keyed back to your custom_id per line.
The economics are simple:
- 50% discount on input tokens.
- 50% discount on output tokens.
- Stacks with prompt caching — you can layer cache discounts on top of the batch discount on the cached portion of your input.
- Batches up to 100,000 requests per submission.
- 24-hour SLA — most batches finish in minutes to a few hours, but you must not depend on faster.
This is the cheapest way to run Claude at scale, period.
Ideal workloads
Batch is the right primitive when none of your individual requests are time-sensitive. Concretely:
- Overnight classification of millions of records. Tagging support tickets, routing leads, scoring documents for relevance. Submit at 18:00, results ready by morning.
- Dataset labeling and synthetic data generation. Building eval sets, generating training pairs, augmenting RAG corpora with summaries, titles, or extracted entities.
- Monthly content moderation sweeps. Re-scanning historical user-generated content against an updated policy. The job runs once a month, latency does not matter, volume is enormous.
- Eval runs. Sweeping a new prompt or model across your full regression set of thousands of test cases.
- RAG corpus enrichment. Pre-computing summaries, FAQ extractions, or chunk-level classifications for your knowledge base. Build once, serve cheaply forever.
- Embeddings-style bulk semantic processing. When you would otherwise reach for embeddings + a classifier but actually need a language-model judgment per item, batch makes the LLM-per-item path affordable.
When NOT to batch
Just as important — do not use Batch for:
- Interactive UX. Anything a user is waiting on. Chat, code completion, in-product agents. The 24-hour SLA is fatal here.
- Time-sensitive workflows. Fraud decisions, real-time moderation of live posts, alerting pipelines.
- Tiny jobs. If you have 50 requests, the operational overhead of uploading, polling, and joining results back is not worth the saved pennies. Batch starts paying off in the thousands.
- Workloads with tight inter-request dependencies. If request N+1 needs the output of request N, you need a synchronous loop, not a batch.
Pricing math: a worked example
Take a realistic bulk classification job: 100,000 Sonnet 4.6 calls, 1,000 input tokens and 500 output tokens each. That is 100M input tokens and 50M output tokens total.
At synchronous rates this is a known number from your monthly invoice. Through the Batch API the same workload costs exactly half:
- Synchronous: input cost + output cost =
X. - Batch:
X / 2.
For a job at the scale above, that 50% line item is typically the difference between "we'll do it once a quarter" and "we run this nightly." If you also cache a stable system prompt across all 100,000 requests, the cache discount stacks on the cached portion — pushing effective input cost down further still.
Code: building a batch JSONL
Each line of the input file is a self-contained request with a
custom_id you control. Through Claudexia, point at
https://api.claudexia.tech/v1 — the Batch API is passed through
transparently.
import json
from anthropic import Anthropic
client = Anthropic(
base_url="https://api.claudexia.tech/v1",
api_key="cxa_your_key_here",
)
# 1. Build the JSONL request file
records = load_records_to_classify() # your 100k items
with open("batch.jsonl", "w") as f:
for r in records:
line = {
"custom_id": f"record-{r['id']}",
"params": {
"model": "claude-sonnet-4.6",
"max_tokens": 256,
"system": [
{
"type": "text",
"text": CLASSIFIER_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
"messages": [
{"role": "user", "content": r["text"]},
],
},
}
f.write(json.dumps(line) + "\n")
# 2. Submit the batch
batch = client.messages.batches.create(
requests=[json.loads(line) for line in open("batch.jsonl")]
)
print("submitted:", batch.id)
Polling and result handling
Polling is dead simple. Check processing_status on a sane interval —
every 60 seconds is plenty, every 5 minutes is more than enough for
most jobs. Do not poll in a hot loop.
import time
while True:
b = client.messages.batches.retrieve(batch.id)
if b.processing_status == "ended":
break
time.sleep(60)
# Stream results line-by-line
for result in client.messages.batches.results(batch.id):
cid = result.custom_id
if result.result.type == "succeeded":
message = result.result.message
save_classification(cid, message.content[0].text)
elif result.result.type == "errored":
log_error(cid, result.result.error)
elif result.result.type == "expired":
# request didn't complete within 24h — requeue
requeue(cid)
Per-line error handling
Crucially, the Batch API reports errors per request, not per batch.
A handful of failed lines do not poison the rest. Your processing code
must handle three terminal states for each custom_id:
succeeded— normal response, store and move on.errored— model or validation error; log, optionally retry as a one-off synchronous call, and continue.expired— the SLA elapsed before this specific request was served; requeue it in the next batch.
Treat the result file as a stream. Do not load 100k responses into memory.
Stack with prompt caching for compounding savings
Batch and prompt caching are independent discounts. If your 100,000
requests share a long system prompt or document context, mark it with
cache_control and the cached portion gets the cache discount on top
of the 50% batch discount. For agent-style classifiers with a 4k
token system prompt, this is the single biggest lever after switching
to batch in the first place.
Operational tip on Claudexia
Claudexia passes the Batch API through to Anthropic transparently — same
endpoints, same JSONL format, same SDK. One operational note specific
to a pay-per-token gateway: a batch is committed at submission time, so
top up enough balance to cover the whole job before you call
batches.create. The dashboard shows estimated cost per batch based
on the input file; budget against that plus a safety margin for output
tokens.
Bottom line
If a workload does not need to answer in seconds, it should not be running on the synchronous API. Move overnight jobs, dataset processing, monthly sweeps, and eval runs to Batch, layer prompt caching on the stable parts, and reserve the synchronous API for anything a user is actually waiting for.