Voice AI in 2026: The Complete Stack From Whisper to Speaker

digitado ⋅ 12 de April de 2026

ASR, LLM, TTS — How to Wire Them Into a Single Low-Latency Pipeline

Every week someone asks me the same question: “I want to build a voice AI agent. Where do I start?”

The answer used to be complicated. You needed to stitch together six different services, manage WebSocket connections, handle audio encoding, deal with silence detection — and somehow make it all feel real-time.

In 2026, the stack has matured. The pieces fit together better. But the documentation hasn’t caught up. Most tutorials show you how to call Whisper or ElevenLabs in isolation. Nobody shows you how to wire the full pipeline end to end.

This article fixes that. One complete stack. Every layer explained. Working code.

The Five Layers

Every voice AI system has the same five layers, whether it’s a phone bot, a smart speaker, or an in-app assistant:

Layer 1: Audio Capture     → Getting clean audio from the user
Layer 2: Speech-to-Text    → Converting audio to text (ASR)
Layer 3: Understanding     → Processing text and deciding what to do (LLM)
Layer 4: Text-to-Speech    → Converting the response back to audio (TTS)
Layer 5: Audio Playback    → Delivering audio to the user

Simple in theory. The devil is in the connections between layers.

Layer 1: Audio Capture (The Part Everyone Skips)

Most tutorials start at ASR. That’s a mistake. Bad audio in means bad transcription out.

What matters:

Sample rate: 16kHz minimum for speech. Most ASR models expect 16kHz mono. Sending 44.1kHz stereo wastes bandwidth and adds latency from resampling.
Chunk size: Stream audio in small chunks — 20 to 100ms per chunk. Larger chunks add latency. Smaller chunks add overhead.
Noise handling: Users talk in cars, cafes, and open offices. Without preprocessing, your ASR will transcribe background conversations, music, and keyboard clicks.

import sounddevice as sd
import numpy as np

SAMPLE_RATE = 16000
CHUNK_DURATION_MS = 80
CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION_MS / 1000)

def audio_callback(indata, frames, time, status):
    # Send raw PCM audio to the ASR stream
    audio_chunk = indata[:, 0].tobytes()  # Mono channel
    asyncio.run_coroutine_threadsafe(
        send_to_asr(audio_chunk), loop
    )

stream = sd.InputStream(
    samplerate=SAMPLE_RATE,
    channels=1,
    dtype='int16',
    blocksize=CHUNK_SIZE,
    callback=audio_callback
)

Production tip: Add a noise gate. If the audio energy is below a threshold, don’t send it to ASR. This reduces unnecessary API calls and prevents the ASR from transcribing silence as “um” or random words.

Layer 2: Speech-to-Text (ASR)

This is where audio becomes text. The two models that matter in 2026:

Deepgram Nova-3 — The production choice. Streaming API with 200 to 300ms latency. Supports interim results (partial transcriptions while the user is still speaking). Best price-to-performance ratio.

OpenAI Whisper — The open-source choice. Run it locally for zero API cost. But real-time streaming requires extra work — Whisper is designed for batch transcription, not streaming.

from deepgram import DeepgramClient, LiveOptions

deepgram = DeepgramClient(api_key)

connection = deepgram.listen.live.v("1")

options = LiveOptions(
    model="nova-3",
    language="en",
    smart_format=True,
    interim_results=True,    # Get partial results as user speaks
    utterance_end_ms=1000,   # Detect end of utterance
    vad_events=True          # Voice activity detection built in
)

connection.on("transcript", handle_transcript)
connection.on("utterance_end", handle_utterance_end)

await connection.start(options)

# Feed audio chunks as they come in
async def send_to_asr(audio_chunk):
    connection.send(audio_chunk)

The critical detail: interim results. Without these, you wait for the user to finish their entire sentence before processing starts. With interim results, your LLM can start “thinking” while the user is still talking. This alone saves 300 to 500ms.

Layer 3: The Brain (LLM)

The LLM receives transcribed text and decides what to do — respond directly, call a tool, ask a clarifying question, or escalate to a human.

What matters for voice:

Time to first token (TTFT): In a text chat, 500ms TTFT is fine. In voice, it’s noticeable dead air. Target under 300ms.
Streaming: You must stream tokens. Waiting for the complete response before starting TTS adds one to three seconds.
Short responses: Train your LLM (via system prompt) to keep responses concise. A 200-word response takes 10 seconds to speak. Nobody wants that.

async def get_llm_response(transcript, conversation_state):
    response_tokens = []
    current_sentence = ""
    
    async for token in llm.stream(
        model="gpt-4o",
        system="""You are a voice assistant. Keep responses under 
        2-3 sentences. Be conversational, not formal. Never use 
        bullet points or numbered lists — speak naturally.""",
        messages=build_messages(transcript, conversation_state)
    ):
        current_sentence += token
        
        # Send complete sentences to TTS immediately
        if token in ".!?":
            await send_to_tts(current_sentence)
            current_sentence = ""

The sentence-boundary trick: Don’t wait for the full LLM response. As soon as a complete sentence is generated, send it to TTS immediately. The user hears the first sentence while the LLM is still generating the second. This is the single biggest latency win in the entire pipeline.

Layer 4: Text-to-Speech (TTS)

Text becomes audio. The landscape changed dramatically in 2025 and 2026.

Cartesia Sonic 3 — My current pick. Sub-200ms latency with streaming support. The voice quality is close to human. Supports input streaming — you can feed partial text and get audio back before the sentence is complete.

ElevenLabs Turbo v3 — The most natural-sounding option. Slightly higher latency (300 to 400ms) but the voice quality is unmatched for certain use cases.

from cartesia import CartesiaClient

cartesia = CartesiaClient(api_key)

async def send_to_tts(text):
    audio_stream = cartesia.tts.stream(
        text=text,
        voice_id="your-voice-id",
        model="sonic-3",
        output_format="pcm_16000"  # Match your audio output
    )
    
    async for audio_chunk in audio_stream:
        await play_audio(audio_chunk)

Voice cloning warning: Both Cartesia and ElevenLabs support voice cloning. If you’re building a product, use their stock voices or get explicit consent for cloned voices. The legal landscape around synthetic voice is evolving fast.

Layer 5: Audio Playback (The Last Mile)

Getting audio to the user’s ears. This varies by platform:

Phone calls: Use Twilio or Vonage media streams. Audio goes over WebSocket as raw PCM or mulaw.
Web apps: Use the Web Audio API. Buffer incoming audio chunks and play them sequentially.
Mobile apps: Platform-specific audio APIs (AVAudioEngine on iOS, AudioTrack on Android).

# For Twilio phone calls — send audio over WebSocket
async def play_audio(audio_chunk):
    # Convert to mulaw for Twilio
    mulaw_audio = audioop.lin2ulaw(audio_chunk, 2)
    base64_audio = base64.b64encode(mulaw_audio).decode()
    
    await websocket.send(json.dumps({
        "event": "media",
        "streamSid": stream_sid,
        "media": {
            "payload": base64_audio
        }
    }))

The Glue: Voice Activity Detection

VAD is the invisible component that makes everything work. Without it, your agent doesn’t know when the user started or stopped speaking.

Silero VAD is the industry standard. It’s a tiny neural network (under 1MB) that runs in real-time on CPU.

import torch

model, utils = torch.hub.load(
    'snakers4/silero-vad', 'silero_vad'
)

def is_speech(audio_chunk):
    tensor = torch.FloatTensor(audio_chunk)
    confidence = model(tensor, SAMPLE_RATE).item()
    return confidence > 0.5

What VAD gives you:

Start of speech: Stop playing agent audio (barge-in support).
End of speech: Trigger the LLM to start generating a response.
Silence filtering: Don’t send empty audio to ASR.

Putting It All Together

Here is the complete flow, happening in real-time:

User speaks
  → Audio captured at 16kHz, 80ms chunks
  → VAD filters silence
  → Deepgram streams partial transcripts
  → Utterance end detected
  → LLM starts generating (streaming)
  → First complete sentence sent to Cartesia
  → Audio starts playing to user
  
Total time from user stops speaking to audio starts: 600-900ms

The frameworks that handle this wiring for you:

Pipecat — Open-source, Python-based. Best for custom pipelines. I contribute to this one.
LiveKit Agents — Managed infrastructure. Best if you don’t want to handle WebRTC yourself.
Retell AI — Fully managed. Best if you want zero infrastructure work but less customization.

The Three Mistakes That Will Cost You Weeks

Mistake one: Not streaming end to end. If any layer in your pipeline waits for the previous layer to complete, you’ve added seconds of latency. Every layer must stream.

Mistake two: Ignoring audio encoding. Sending 44.1kHz WAV to an ASR that expects 16kHz PCM means your audio gets resampled on the server. That’s 100 to 200ms wasted per request. Match your formats.

Mistake three: Building without VAD. Without VAD, your agent either interrupts the user constantly or waits too long to respond. Get VAD working before you touch anything else.

Start Here

If I were starting from zero today:

Get Silero VAD running locally. Feed it your microphone. Make sure it detects speech starts and stops accurately.
Connect Deepgram’s streaming API. Send audio chunks, receive transcripts.
Wire up Claude or GPT-4o with streaming. Feed transcripts in, stream tokens out.
Connect Cartesia’s streaming TTS. Feed sentences in, play audio out.
Add barge-in support — when VAD detects user speech while the agent is talking, stop playback.

Five steps. Each one is a standalone afternoon project. Together, they’re a production voice agent.

Follow me for more practical voice AI content. I write about the messy reality of building real-time AI systems.

References: [1] Deepgram streaming API — developers.deepgram.com [2] Cartesia Sonic documentation — docs.cartesia.ai [3] Silero VAD — github.com/snakers4/silero-vad [4] Pipecat framework — github.com/pipecat-ai/pipecat [5] LiveKit Agents — docs.livekit.io

Voice AI in 2026: The Complete Stack From Whisper to Speaker was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Like 0

Liked Liked