Skip to content
COST OPTIMIZATION

Claude Batch API: 50% Off Bulk Inference in 2026

Anthropic's Batch API runs jobs within 24h at half price. When to use it for embeddings replacement, classification at scale, content moderation, and dataset processing.

If your Claude bill has a long tail of non-interactive workloads — overnight classification, dataset labeling, monthly content moderation sweeps, eval runs — you are leaving roughly half your money on the table by sending those calls through the synchronous Messages API. The Batch API exists for exactly this case: submit a job, wait up to 24 hours, pay 50% less on both input and output tokens.

What the Batch API actually is

The Batch API is an asynchronous job queue. Instead of sending one request at a time and waiting for a response, you upload a JSONL file where each line is a complete Messages API request, Anthropic processes the whole batch within a 24-hour SLA, and you download a JSONL file of responses keyed back to your custom_id per line.

The economics are simple:

  • 50% discount on input tokens.
  • 50% discount on output tokens.
  • Stacks with prompt caching — you can layer cache discounts on top of the batch discount on the cached portion of your input.
  • Batches up to 100,000 requests per submission.
  • 24-hour SLA — most batches finish in minutes to a few hours, but you must not depend on faster.

This is the cheapest way to run Claude at scale, period.

Ideal workloads

Batch is the right primitive when none of your individual requests are time-sensitive. Concretely:

  • Overnight classification of millions of records. Tagging support tickets, routing leads, scoring documents for relevance. Submit at 18:00, results ready by morning.
  • Dataset labeling and synthetic data generation. Building eval sets, generating training pairs, augmenting RAG corpora with summaries, titles, or extracted entities.
  • Monthly content moderation sweeps. Re-scanning historical user-generated content against an updated policy. The job runs once a month, latency does not matter, volume is enormous.
  • Eval runs. Sweeping a new prompt or model across your full regression set of thousands of test cases.
  • RAG corpus enrichment. Pre-computing summaries, FAQ extractions, or chunk-level classifications for your knowledge base. Build once, serve cheaply forever.
  • Embeddings-style bulk semantic processing. When you would otherwise reach for embeddings + a classifier but actually need a language-model judgment per item, batch makes the LLM-per-item path affordable.

When NOT to batch

Just as important — do not use Batch for:

  • Interactive UX. Anything a user is waiting on. Chat, code completion, in-product agents. The 24-hour SLA is fatal here.
  • Time-sensitive workflows. Fraud decisions, real-time moderation of live posts, alerting pipelines.
  • Tiny jobs. If you have 50 requests, the operational overhead of uploading, polling, and joining results back is not worth the saved pennies. Batch starts paying off in the thousands.
  • Workloads with tight inter-request dependencies. If request N+1 needs the output of request N, you need a synchronous loop, not a batch.

Pricing math: a worked example

Take a realistic bulk classification job: 100,000 Sonnet 4.6 calls, 1,000 input tokens and 500 output tokens each. That is 100M input tokens and 50M output tokens total.

At synchronous rates this is a known number from your monthly invoice. Through the Batch API the same workload costs exactly half:

  • Synchronous: input cost + output cost = X.
  • Batch: X / 2.

For a job at the scale above, that 50% line item is typically the difference between "we'll do it once a quarter" and "we run this nightly." If you also cache a stable system prompt across all 100,000 requests, the cache discount stacks on the cached portion — pushing effective input cost down further still.

Code: building a batch JSONL

Each line of the input file is a self-contained request with a custom_id you control. Through Claudexia, point at https://api.claudexia.tech/v1 — the Batch API is passed through transparently.

import json
from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.claudexia.tech/v1",
    api_key="cxa_your_key_here",
)

# 1. Build the JSONL request file
records = load_records_to_classify()  # your 100k items

with open("batch.jsonl", "w") as f:
    for r in records:
        line = {
            "custom_id": f"record-{r['id']}",
            "params": {
                "model": "claude-sonnet-4.6",
                "max_tokens": 256,
                "system": [
                    {
                        "type": "text",
                        "text": CLASSIFIER_SYSTEM_PROMPT,
                        "cache_control": {"type": "ephemeral"},
                    }
                ],
                "messages": [
                    {"role": "user", "content": r["text"]},
                ],
            },
        }
        f.write(json.dumps(line) + "\n")

# 2. Submit the batch
batch = client.messages.batches.create(
    requests=[json.loads(line) for line in open("batch.jsonl")]
)
print("submitted:", batch.id)

Polling and result handling

Polling is dead simple. Check processing_status on a sane interval — every 60 seconds is plenty, every 5 minutes is more than enough for most jobs. Do not poll in a hot loop.

import time

while True:
    b = client.messages.batches.retrieve(batch.id)
    if b.processing_status == "ended":
        break
    time.sleep(60)

# Stream results line-by-line
for result in client.messages.batches.results(batch.id):
    cid = result.custom_id
    if result.result.type == "succeeded":
        message = result.result.message
        save_classification(cid, message.content[0].text)
    elif result.result.type == "errored":
        log_error(cid, result.result.error)
    elif result.result.type == "expired":
        # request didn't complete within 24h — requeue
        requeue(cid)

Per-line error handling

Crucially, the Batch API reports errors per request, not per batch. A handful of failed lines do not poison the rest. Your processing code must handle three terminal states for each custom_id:

  • succeeded — normal response, store and move on.
  • errored — model or validation error; log, optionally retry as a one-off synchronous call, and continue.
  • expired — the SLA elapsed before this specific request was served; requeue it in the next batch.

Treat the result file as a stream. Do not load 100k responses into memory.

Stack with prompt caching for compounding savings

Batch and prompt caching are independent discounts. If your 100,000 requests share a long system prompt or document context, mark it with cache_control and the cached portion gets the cache discount on top of the 50% batch discount. For agent-style classifiers with a 4k token system prompt, this is the single biggest lever after switching to batch in the first place.

Operational tip on Claudexia

Claudexia passes the Batch API through to Anthropic transparently — same endpoints, same JSONL format, same SDK. One operational note specific to a pay-per-token gateway: a batch is committed at submission time, so top up enough balance to cover the whole job before you call batches.create. The dashboard shows estimated cost per batch based on the input file; budget against that plus a safety margin for output tokens.

Bottom line

If a workload does not need to answer in seconds, it should not be running on the synchronous API. Move overnight jobs, dataset processing, monthly sweeps, and eval runs to Batch, layer prompt caching on the stable parts, and reserve the synchronous API for anything a user is actually waiting for.