If your Claude-powered product feels sluggish, the problem is almost never total throughput — it is time to first token (TTFT). A non-streaming completion that takes 8 seconds to return 600 tokens feels broken. The exact same 600 tokens, streamed, feels instant because the first words land in under 400 ms and the rest scrolls in like a person typing. Streaming does not make the model faster; it makes the perceived latency roughly half of total latency, which is the metric your users actually feel.
This guide covers how Claude streams responses over Server-Sent Events
(SSE), the Anthropic event taxonomy you must handle, and production
patterns in TypeScript (Next.js Edge runtime) and Python (httpx async).
All examples target Claudexia's Anthropic-compatible gateway at
https://api.claudexia.tech/v1, which is a drop-in for api.anthropic.com/v1.
TTFT vs total latency
When you benchmark a Claude call, record two numbers:
- TTFT — milliseconds until the first content delta arrives.
- Total — milliseconds until
message_stop.
For a Sonnet call producing ~800 output tokens, TTFT is typically 300–600 ms while total is 4–8 seconds. If you are not streaming, your users wait the full 8 seconds staring at a spinner. If you are streaming, they see text in 400 ms and read along as it generates. The total tokens-per-second (TPS) is identical either way; you are buying perception, not throughput.
The SSE wire format
Server-Sent Events is a one-way HTTP streaming protocol. The response
has Content-Type: text/event-stream and the body is a sequence of
records separated by blank lines. Each record looks like:
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
Two rules trip up everyone:
- Records are terminated by two newlines (
\n\n), not one. - A single logical event may contain multiple
data:lines, which are joined with\nby the consumer before JSON-parsing.
Most SDKs hide this for you, but the moment you parse SSE by hand — inside an Edge Worker, in Go, or for debugging — you must respect it.
Anthropic event types
Claude's stream emits a small, well-defined set of events. Handle each one explicitly; do not assume order beyond what the spec guarantees.
message_start— initial message envelope withid,model, and a zeroedusageblock. Capture the message id here for logging.content_block_start— a new content block begins. Blocks can betext,tool_use,thinking, orredacted_thinking. Index matters if the model emits multiple blocks.content_block_delta— incremental payload. For text it carriestext_delta; for tool calls it carriesinput_json_delta(partial JSON fragments you must concatenate before parsing).content_block_stop— the block is finished.message_delta— top-level message updates, most importantlystop_reason(end_turn,max_tokens,tool_use,stop_sequence) and finalusagecounters.message_stop— terminal event. Close your reader.ping— keep-alive sent every ~15 seconds. Ignore the payload but do not close the stream; pings exist precisely so reverse proxies do not kill an idle connection.error— yes, errors can arrive mid-stream as a regular SSE event (overloaded model, content policy stop, upstream timeout). Your handler must treat error-as-event the same as a thrown error.
TypeScript: Next.js Edge Route Handler
The cleanest pattern for a Next.js app is an Edge Route Handler that
proxies the Claude stream straight to the browser. Edge runtime gives
you native ReadableStream plumbing and no cold-start tax.
// app/api/chat/route.ts
import Anthropic from "@anthropic-ai/sdk";
export const runtime = "edge";
const client = new Anthropic({
apiKey: process.env.CLAUDEXIA_API_KEY!,
baseURL: "https://api.claudexia.tech/v1",
});
export async function POST(req: Request) {
const { messages } = await req.json();
const controller = new AbortController();
req.signal.addEventListener("abort", () => controller.abort());
const stream = await client.messages.stream(
{
model: "claude-sonnet-4.6",
max_tokens: 1024,
messages,
},
{ signal: controller.signal },
);
const encoder = new TextEncoder();
const body = new ReadableStream({
async start(ctrl) {
try {
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
ctrl.enqueue(encoder.encode(event.delta.text));
}
}
} catch (err) {
ctrl.enqueue(encoder.encode(`\n[error] ${(err as Error).message}`));
} finally {
ctrl.close();
}
},
cancel() {
controller.abort();
},
});
return new Response(body, {
headers: {
"Content-Type": "text/plain; charset=utf-8",
"Cache-Control": "no-cache, no-transform",
"X-Accel-Buffering": "no",
},
});
}
Three details that bite teams in production:
X-Accel-Buffering: nodisables nginx response buffering. Without it, your stream gets buffered into a single chunk and TTFT collapses back to the non-streaming case.Cache-Control: no-transformprevents intermediaries from gzipping and re-chunking the response.- Wiring
req.signalinto anAbortControlleris what lets you cancel the upstream Claude call when the browser tab closes. Without it you keep paying for tokens nobody will read.
Python: httpx async streaming
For backend-only services, the official anthropic Python SDK already
streams. When you need lower-level control — a custom proxy, a
broadcast fan-out, or instrumentation — drop to httpx directly.
import json
import httpx
URL = "https://api.claudexia.tech/v1/messages"
async def stream_claude(prompt: str, api_key: str):
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
"accept": "text/event-stream",
}
payload = {
"model": "claude-sonnet-4.6",
"max_tokens": 1024,
"stream": True,
"messages": [{"role": "user", "content": prompt}],
}
async with httpx.AsyncClient(timeout=None) as client:
async with client.stream("POST", URL, headers=headers, json=payload) as resp:
resp.raise_for_status()
event_name = None
async for line in resp.aiter_lines():
if not line:
event_name = None
continue
if line.startswith("event: "):
event_name = line[7:].strip()
elif line.startswith("data: "):
data = json.loads(line[6:])
if event_name == "content_block_delta":
delta = data.get("delta", {})
if delta.get("type") == "text_delta":
yield delta["text"]
elif event_name == "error":
raise RuntimeError(data.get("error", {}).get("message", "stream error"))
elif event_name == "message_stop":
return
Notes:
timeout=Noneon the client is mandatory. The default 5-second read timeout will kill any stream longer than five seconds.aiter_lines()handles the\r\nvs\nnormalisation for you.- The blank-line check resets
event_name, which matters because SSE records are separated by an empty line and the next record may omit theevent:field (defaulting tomessage).
Backpressure and cancellation
Claude can produce tokens faster than your downstream consumer can write them — to the browser, to a database, to another service. If you do not respect backpressure, the runtime queues bytes in memory and you get OOMs at scale.
In Node, ReadableStream with a default byteLength queuing strategy
gives you backpressure for free. In Python, prefer async for over
collecting into a list. For broadcast fan-out (one Claude stream → many
WebSocket clients), use a bounded asyncio.Queue per client and drop
the slow client, never the upstream.
Cancellation is the other half. When the user closes the tab, you must abort the upstream request:
- TypeScript: forward
request.signalinto the SDK'ssignaloption. - Python: wrap the body loop in
try/finallyand rely on httpx's context manager to close the connection — or callawait resp.aclose()explicitly when an upstream client disconnects.
You pay for every token the model generates, even ones nobody reads. Cancellation is a cost-control feature, not just a UX nicety.
Common bugs to avoid
- Forgetting to flush. Express, Fastify, and any custom Node
framework will buffer writes by default. Set
Content-Type: text/event-stream,Cache-Control: no-cache,X-Accel-Buffering: no, and callres.flushHeaders()immediately. - Missing pings → 60-second proxy timeout. Cloudflare, nginx, and
most ALBs will close idle HTTP connections after 60 seconds. Claude's
pingevents keep the connection warm; if you filter them out and your model is thinking for a while before producing text, the proxy cuts you off. Either pass pings through, or emit your own keep-alive comments (: keepalive\n\n) every 15 seconds. - Concatenating tool-use JSON wrong.
input_json_deltachunks are partial JSON fragments. You must accumulate the string and onlyJSON.parseoncecontent_block_stopfires for that block. - Treating mid-stream errors as transport errors. An
errorevent is a normal SSE record, not an HTTP failure. Your reader will not throw — you have to check the event type and raise yourself. - Logging deltas synchronously.
console.logon every text_delta tanks throughput. Buffer logs and emit onmessage_stop.
Bottom line
Streaming is the single highest-leverage UX change you can make to a
Claude integration. Use the SDK where you can, drop to raw SSE when
you must, and remember: handle pings, abort on disconnect, and never
trust the default proxy timeouts. Point your base_url at
https://api.claudexia.tech/v1, keep your existing Anthropic SDK code,
and ship.