Skip to content
VOICE

Voice Agents with Claude in 2026: Best STT + Claude + TTS Stack

Claude has no native realtime audio mode — but pairing Deepgram or Whisper STT with Claude Sonnet and ElevenLabs TTS gives a sub-second voice agent. Architecture and code.

Voice Agents with Claude in 2026: Best STT + Claude + TTS Stack

If you tried to build a voice agent this year, you ran into the same wall everyone else did: Claude does not have a native realtime audio API. There is no claude-realtime model that takes microphone bytes in and emits speech bytes out. OpenAI has one — gpt-4o-realtime — and Google has Gemini Live. Anthropic, as of this writing, does not.

That sounds like a problem. It isn't. The pipeline approach — STT → LLM → TTS — has gotten so fast in 2026 that a well-built Claude voice stack hits sub-700ms response latency, which is below the threshold humans perceive as "a pause." And you get Claude's response quality, which for most agentic and reasoning-heavy voice use cases (support, sales qualification, technical assistants) is materially better than GPT-4o-realtime's audio-native output.

This is the architecture we ship in production, the latency budget that makes it work, and the code to wire it up.

State of voice in 2026

Three architectures dominate:

1. Audio-native realtime (one socket). OpenAI Realtime API and Gemini Live. You open a WebSocket, stream PCM in, get PCM out. The model handles VAD, interruption, and turn-taking internally. Latency is excellent (~300–500ms end-to-end). The catch: you are locked into the audio-native model. With OpenAI that means GPT-4o. You don't get Claude Sonnet, you don't get o3, you don't get any non-OpenAI tool calling.

2. Cascaded pipeline (STT → LLM → TTS). Three providers, three streams, glued together. More moving parts. More latency in theory. But you pick best-of-breed at each layer, and in practice modern streaming makes it competitive.

3. Hybrid: realtime STT + LLM-as-text + realtime TTS. What this post is about. Use Deepgram Nova-3 or Whisper streaming for ears, Claude Sonnet 4.5 streaming for brain, ElevenLabs Flash for voice. Total round-trip in production: 600–800ms.

The reason pipeline wins for Claude users is simple: you cannot do option 1 with Claude. So the question is whether option 2/3 closes the latency gap enough that you don't pay for using Claude. It does.

Why Claude pipeline beats OpenAI Realtime for most agents

Audio-native models compress the LLM into the same forward pass as the speech encoder/decoder. That's how they get fast — but it also means the "reasoning" portion is shallower than text-mode equivalents. GPT-4o-realtime is noticeably weaker at multi-step reasoning, tool use, and instruction following than GPT-4o-text, and weaker still than Claude Sonnet 4.5.

For voice use cases that are mostly chitchat (companion bots, language tutoring), audio-native is fine. For voice use cases that require:

  • Reading and summarizing a customer's account in real time
  • Multi-tool agentic flows (look up order, check inventory, schedule callback)
  • Following a complex script or compliance rules
  • Long-context grounding on docs

— Claude Sonnet 4.5 in a pipeline beats GPT-4o-realtime on quality. The 200–400ms latency penalty vs. one-socket realtime is worth it.

The recommended stack

Mic → VAD (Silero) → STT (Deepgram Nova-3 streaming) →
  Claude Sonnet 4.5 (streaming) →
  TTS (ElevenLabs Flash v2.5) → Speaker

Why these picks:

Deepgram Nova-3 for STT. Streaming partial transcripts at ~150ms cadence, final transcripts at end-of-utterance ~200ms. WER on conversational English is ~5%. Whisper Large v3 streaming via groq is a credible alternative (cheaper, slightly higher latency, slightly higher WER). For non-English I lean Whisper.

Claude Sonnet 4.5 for the brain. Streaming via the Anthropic Messages API, time-to-first-token typically 250–400ms on short prompts. We route through https://api.claudexia.tech/v1 which adds ~30ms of proxy overhead and gives us caching + observability.

ElevenLabs Flash v2.5 for TTS. ~200ms time-to-first-audio when fed token-by-token. The newer Turbo v3 is faster but voices are less natural; Flash is the sweet spot in 2026.

Silero VAD locally. Tiny ONNX model, 10ms frames, runs on CPU. Tells you when the user starts and stops talking so you don't ship silence to the STT and don't keep the LLM waiting on a non-utterance.

Latency budget

Where the ~700ms goes, end-of-user-speech to first-audio-out:

StageBudget
VAD end-of-speech detection80ms
Deepgram final transcript150ms
Network: app → Claude40ms
Claude time-to-first-token300ms
Network: Claude → TTS30ms
ElevenLabs time-to-first-audio200ms
Audio buffer warmup50ms
Total perceived latency~700ms

You can shave this further:

  • Speculative TTS: start sending tokens to TTS as soon as Claude emits its first sentence-end, even before the LLM finishes. Saves ~300ms on long responses.
  • Predictive STT finalization: trigger the LLM call on a high-confidence partial instead of waiting for the final. Risky but cuts ~120ms.
  • Endpointing tuning: drop VAD silence threshold from 500ms to 300ms. Cuts 200ms of dead air, but increases false-trigger rate.

Aggressive tuning gets you to ~450ms. Below that you're competing with OpenAI Realtime on its home turf.

Interruption and barge-in

The hardest part of a voice agent isn't latency — it's interruption. The user starts talking while the bot is talking. The bot has to:

  1. Detect speech onset within ~100ms (VAD on the input mic, not gated on output).
  2. Stop TTS playback immediately (kill the audio buffer, not just stop generating).
  3. Cancel the in-flight Claude request (AbortController on the SDK call).
  4. Discard the partial response or fold it into context as "[interrupted]".

The state machine looks like:

IDLE → LISTENING → THINKING → SPEAKING → (back to LISTENING)
                       ↑                      ↓
                       └── BARGE-IN ──────────┘

A common bug: the bot's own audio leaking into the mic and triggering a self-interrupt. Use echo cancellation (WebRTC AEC if browser, RNNoise if server-side) and a confidence threshold on the VAD.

Code: WebSocket bridge in TypeScript

Minimal end-to-end loop. Browser sends mic PCM to server, server orchestrates STT/LLM/TTS, streams audio back.

import Anthropic from "@anthropic-ai/sdk";
import { createClient as deepgram } from "@deepgram/sdk";
import { ElevenLabsClient } from "elevenlabs";
import WebSocket from "ws";

const claude = new Anthropic({
  baseURL: "https://api.claudexia.tech/v1",
  apiKey: process.env.CLAUDEXIA_API_KEY,
});
const dg = deepgram(process.env.DEEPGRAM_KEY);
const tts = new ElevenLabsClient({ apiKey: process.env.ELEVEN_KEY });

const wss = new WebSocket.Server({ port: 8080 });

wss.on("connection", (client) => {
  const history: Array<{ role: "user" | "assistant"; content: string }> = [];
  let abortLLM: AbortController | null = null;

  const stt = dg.listen.live({
    model: "nova-3",
    language: "en-US",
    smart_format: true,
    interim_results: true,
    endpointing: 300,
  });

  stt.on("Results", async (data: any) => {
    const transcript = data.channel.alternatives[0].transcript;
    if (!transcript) return;

    // Barge-in: user spoke while we were speaking
    if (abortLLM) {
      abortLLM.abort();
      client.send(JSON.stringify({ type: "stop_audio" }));
    }

    if (!data.is_final) return;

    history.push({ role: "user", content: transcript });
    abortLLM = new AbortController();

    const stream = await claude.messages.stream(
      {
        model: "claude-sonnet-4.5",
        max_tokens: 1024,
        system: "You are a helpful voice assistant. Keep replies under 2 sentences.",
        messages: history,
      },
      { signal: abortLLM.signal },
    );

    let buffer = "";
    let assistantText = "";
    for await (const chunk of stream) {
      if (chunk.type !== "content_block_delta") continue;
      const delta = (chunk.delta as any).text ?? "";
      buffer += delta;
      assistantText += delta;

      // Flush to TTS on sentence boundary
      const sentenceEnd = buffer.match(/[.!?]\s/);
      if (sentenceEnd) {
        const sentence = buffer.slice(0, sentenceEnd.index! + 1);
        buffer = buffer.slice(sentenceEnd.index! + 2);
        streamTTS(sentence, client);
      }
    }
    if (buffer.trim()) streamTTS(buffer, client);
    history.push({ role: "assistant", content: assistantText });
    abortLLM = null;
  });

  client.on("message", (raw) => {
    // Browser sends raw 16-bit PCM @ 16kHz
    stt.send(raw);
  });
});

async function streamTTS(text: string, client: WebSocket) {
  const audio = await tts.textToSpeech.convertAsStream(
    "21m00Tcm4TlvDq8ikWAM", // voice id
    {
      text,
      model_id: "eleven_flash_v2_5",
      output_format: "pcm_16000",
    },
  );
  for await (const chunk of audio) {
    client.send(chunk);
  }
}

This is ~80 lines and gets you a working voice loop with barge-in. Production code adds reconnect logic, audio resampling, prompt caching on Claude, and metrics.

When to use OpenAI Realtime instead

Don't use Claude pipeline if:

  • GPT-4o quality is sufficient for your use case (most consumer chitchat is fine).
  • You want one socket and one vendor for operational simplicity.
  • You need < 400ms end-to-end and can't speculative-stream.
  • You don't need Anthropic-specific features (Claude's tool use, long context, MCP).

Use Claude pipeline if:

  • The agent is doing real reasoning, multi-step tool calls, or compliance-bound flows.
  • You need 300K+ context for grounding.
  • You already have a Claude text agent and want to add voice without re-implementing the brain.
  • You want to swap STT/TTS providers independently (e.g. Whisper for non-English, regional TTS for compliance).

Bottom line

In 2026 the choice is not "Claude or realtime" — it's "Claude pipeline or GPT-4o realtime." For agentic voice work where the response actually has to be correct, Claude pipeline wins on quality and the 200ms latency tax is invisible to users. For pure conversational latency-at-all-costs, OpenAI Realtime is still the play.

When Anthropic ships a native realtime audio mode — and they will — this whole post becomes obsolete. Until then, Deepgram + Claude Sonnet 4.5 + ElevenLabs Flash is the stack to beat.

For Claude pricing details see our Claude API pricing breakdown for 2026.