Skip to content
PERFORMANCE

Claude API Latency Tuning in 2026: Cut TTFT and Total Time by 60%

Streaming, prompt caching, model choice, and concurrency together can cut Claude API latency by 60%. Here is the production playbook with measurements.

Latency is the silent tax on every Claude-powered product. You can ship the smartest agent in the world, but if the first token takes three seconds to appear, users will think it is broken. In 2026, with Sonnet 4.6 and Opus 4.6 in production and agent fan-out becoming the default architecture, latency tuning has stopped being an optimisation and started being a requirement. This post is the playbook we use internally and recommend to every team building on the Claude API through Claudexia.

The anatomy of a Claude API request

Before tuning anything, you need a shared vocabulary for what "latency" even means. There are three numbers worth tracking:

  • TTFT (time to first token) — the wall-clock time from the moment your client sends the request to the moment the first streamed token arrives. This is what the user perceives as "did it start?".
  • Total time — TTFT plus the time spent generating the rest of the response. As a rough model: total = TTFT + output_tokens / tokens_per_second.
  • Tail latency (p99) — the slowest 1% of your requests. Tail is what determines whether your timeouts fire and your queues back up. A median of 800ms with a p99 of 12s is a worse system than a median of 1.2s with a p99 of 2.5s.

If you are only watching averages, you are flying blind. Track all three per model, per region, per prompt template.

Per-model TTFT, measured

We run continuous probes from a EU PoP against the Claudexia gateway, which sits in the same region. Numbers below are 30-day medians, small prompts (under 1K input tokens), no caching:

  • Haiku — TTFT around 250ms, generation around 180 tokens/sec.
  • Sonnet 4.6 — TTFT around 380ms, generation around 95 tokens/sec.
  • Opus 4.7 — TTFT around 700ms, generation around 55 tokens/sec.

The pattern is consistent: bigger model, longer TTFT, slower stream. Opus is roughly 3× slower per token than Haiku. For interactive UIs this is the single biggest knob you have. A classification step that runs on Opus "because the team standardised on Opus" is probably costing you a second of perceived latency for no quality benefit.

Why streaming halves perceived latency

Non-streaming requests block until the full response is generated. For a 500-token answer on Sonnet, that is roughly 380ms + 500/95 ≈ 5.6s before your UI shows anything. Streaming flips this: the UI can start rendering at TTFT (380ms) and the user reads the response as it is generated. The total wall time is identical, but the perceived latency drops from 5.6s to 380ms. Users consistently rate streamed responses as "fast" even when the total time is longer than a non-streaming alternative.

There is essentially no reason to call the Messages API without stream: true in a user-facing path in 2026. The only exceptions are batch jobs and tool-call orchestration where you need the full message before acting.

Prompt caching cuts TTFT for cached blocks ~50%

Anthropic's prompt caching is usually pitched as a cost optimisation — cached input tokens are billed at roughly 10% of the normal rate — but its latency effect is just as important. For requests that hit a cached prefix, TTFT drops by roughly 50% because the model does not need to re-process the cached tokens. On Sonnet, a 20K-token system prompt that normally adds ~600ms to TTFT collapses to ~300ms once cached.

If you have a long system prompt, a tool schema, or a stable few-shot block, mark it with cache_control: { type: "ephemeral" } and watch your TTFT drop on the second and subsequent requests. The five-minute TTL is short, but in a busy production system you will hit it on almost every request.

Output token count is the biggest lever

The single most underused optimisation is just generating fewer tokens. Every output token on Sonnet costs ~10ms of wall time. A response that drops from 800 tokens to 300 tokens saves five seconds. Three concrete techniques:

  1. Set max_tokens tightly. Do not pass max_tokens: 4096 "just in case". Pass the actual ceiling for the task — 200 for a classification reason, 500 for a summary, 1500 for a code edit.
  2. Instruct conciseness. Add "Respond in under 100 words" or "Output only the JSON object, no explanation" to the system prompt. Models in 2026 are very good at following length constraints.
  3. Use stop sequences. If you know the response ends with </answer> or }, pass it as a stop sequence. The model halts immediately instead of generating trailing whitespace or a closing pleasantry.

Concurrency: agent fan-out with Promise.all

If your agent needs to call the model three times for independent sub-tasks (extract entities, score sentiment, draft summary), do not serialise them. Sequential calls add their TTFTs together; parallel calls hide them behind the slowest one.

const [entities, sentiment, summary] = await Promise.all([
  client.messages.create({ model: "claude-haiku-4", ... }),
  client.messages.create({ model: "claude-haiku-4", ... }),
  client.messages.create({ model: "claude-sonnet-4.6", ... }),
])

The total time is now max(TTFT_a, TTFT_b, TTFT_c) + max(generation_a, b, c) instead of the sum. For a three-step agent this typically halves total latency.

Geography matters more than people admit

Anthropic's primary inference is US-hosted. A round-trip from Frankfurt to us-east-1 adds roughly 90ms per request before any model work begins. For a streaming UI making 10 sequential calls during a session, that is nearly a second of pure network latency.

The Claudexia EU PoP terminates the TLS handshake in-region and proxies to the model with a warm, pooled connection. For European clients this typically removes 60–120ms from TTFT compared to calling Anthropic directly across the Atlantic. It is the cheapest latency win available — no code changes required, just a different base URL.

Code: baseline vs streaming vs cached vs concise

Here is the same task — summarise a 5K-token document — at four optimisation levels, against https://api.claudexia.tech/v1:

import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic({
  apiKey: process.env.CLAUDEXIA_API_KEY,
  baseURL: "https://api.claudexia.tech/v1",
})

// 1. Baseline: no streaming, no cache, no max_tokens discipline
// Measured: TTFT 4.2s (because non-streaming TTFT == total time)
await client.messages.create({
  model: "claude-sonnet-4.6",
  max_tokens: 4096,
  messages: [{ role: "user", content: longDocument + "\n\nSummarise." }],
})

// 2. Add streaming. Total time unchanged, perceived TTFT drops to 0.5s
const stream = await client.messages.stream({
  model: "claude-sonnet-4.6",
  max_tokens: 4096,
  messages: [{ role: "user", content: longDocument + "\n\nSummarise." }],
})

// 3. Add prompt caching on the document
await client.messages.stream({
  model: "claude-sonnet-4.6",
  max_tokens: 4096,
  system: [
    { type: "text", text: longDocument, cache_control: { type: "ephemeral" } },
  ],
  messages: [{ role: "user", content: "Summarise the document above." }],
})

// 4. Add concise instruction + tight max_tokens
await client.messages.stream({
  model: "claude-sonnet-4.6",
  max_tokens: 300,
  stop_sequences: ["</summary>"],
  system: [
    { type: "text", text: longDocument, cache_control: { type: "ephemeral" } },
  ],
  messages: [{
    role: "user",
    content: "Summarise in under 150 words, wrap in <summary>...</summary>.",
  }],
})

Measurement methodology

Tune what you measure. The two metrics worth instrumenting on every request:

const t0 = performance.now()
let ttft: number | null = null
let outputTokens = 0

const stream = await client.messages.stream({ ... })
for await (const event of stream) {
  if (event.type === "content_block_delta" && ttft === null) {
    ttft = performance.now() - t0
  }
  if (event.type === "message_delta") {
    outputTokens = event.usage.output_tokens
  }
}
const total = performance.now() - t0

metrics.histogram("claude.ttft_ms", ttft, { model, cached })
metrics.histogram("claude.total_ms", total, { model, cached })
metrics.histogram("claude.output_tokens", outputTokens, { model })

Capture TTFT and total separately. Tag by model, by whether the cache hit, and by prompt template. Without this you cannot tell whether your optimisations actually worked.

Bottom line: 4.2s → 1.6s on a real workload

The example workload above — summarise a 5K-token document with Sonnet — goes from a baseline of 4.2 seconds total wall time to 1.6 seconds after applying streaming, caching, a tight max_tokens, and serving through the Claudexia EU PoP. That is a 62% reduction in total time and a 91% reduction in perceived TTFT (4.2s to 380ms).

None of these techniques are exotic. They are all available in the official SDK, documented in Anthropic's API reference, and require no infrastructure changes beyond pointing your baseURL at https://api.claudexia.tech/v1. The teams that ship the fastest Claude-powered products in 2026 are not using a secret model — they are just disciplined about TTFT, output length, and cache hit rate.

If your p99 latency is still measured in seconds, start with streaming, then caching, then output discipline, then geography. In that order. You will get most of the win from the first two.