How to Build a Real-Time Voice Agent with Pipecat
Two years ago, building a production-grade voice bot meant hand-rolling WebSocket state machines, managing audio buffers yourself, and debugging race conditions between your speech-to-text and your LLM. Today you can wire it all together in about 100 lines of Python.
This tutorial uses Pipecat — Daily.co’s open-source Voice AI framework — to build a real-time voice agent that listens, thinks, and speaks. We’ll use AssemblyAI’s Universal-3 Pro Streaming model for speech-to-text, GPT-4o for the language model, and Cartesia Sonic for text-to-speech.
By the end you’ll have a working voice agent running in a WebRTC room with proper turn detection, interruption handling, and live transcript logging.
Why Pipecat?
Before Pipecat, the “standard” approach to voice agents involved something like:
- Open a WebSocket to a streaming STT provider
- Stream audio chunks, receive partial transcripts
- Detect end-of-turn with a VAD or timer hack
- Fire off an LLM request
- Stream TTS audio back while buffering it into a playable format
- Hope none of that races with an interruption
Every new voice agent project started with the same 300-line infrastructure file that nobody wanted to maintain.
Pipecat solves this by giving you a typed, event-driven pipeline abstraction. You define a sequence of processors — transport, STT, LLM, TTS, transport output — and Pipecat handles frame routing, backpressure, interruptions, and lifecycle. Swap one processor for another without touching anything else.
The framework is designed specifically for voice AI (not a general pub/sub system bolted onto audio), which means it gets the hard parts right: partial transcripts don’t prematurely trigger LLM calls, TTS can be interrupted mid-sentence, and audio is buffered correctly for WebRTC. n
Architecture Overview
Our agent has four moving parts:
Daily.co WebRTC room
│ audio
▼
Pipecat pipeline
┌───────────────────────────────────────────────┐
│ transport.input() │
│ │ │
│ AssemblyAI Universal-3 Pro Streaming (STT) │
│ │ transcript + turn signal │
│ TranscriptProcessor │
│ │ │
│ OpenAI GPT-4o (streaming) │
│ │ text chunks │
│ Cartesia Sonic (TTS) │
│ │ audio │
│ transport.output() │
└───────────────────────────────────────────────┘
n
- Daily.co handles WebRTC — browser-compatible, no STUN/TURN configuration required
- AssemblyAI Universal-3 Pro Streaming handles STT with neural turn detection
- GPT-4o handles the conversation
- Cartesia Sonic handles TTS
The key architectural detail is that AssemblyAI emits a turn-end signal — not just raw transcript text. The pipeline only forwards a completed user turn to the LLM when AssemblyAI’s neural model is confident the speaker has finished. This is meaningfully different from VAD (voice activity detection), which triggers on silence, not on conversational intent.
Prerequisites
- Python 3.11+
- AssemblyAI API key — app.assemblyai.com
- Daily.co API key — dashboard.daily.co
- OpenAI API key
- Cartesia API key — play.cartesia.ai
Step 1: Install Dependencies
python -m venv .venv && source .venv/bin/activate
pip install "pipecat-ai[assemblyai,cartesia,openai,daily,silero]>=0.0.47" python-dotenv loguru
The pipecat-ai extras pull in the official service integrations. The silero extra installs the Silero VAD model, which we use for pre-filtering silence before sending audio to AssemblyAI.
Create a .env file:
ASSEMBLYAI_API_KEY=your_key_here
DAILY_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
CARTESIA_API_KEY=your_key_here
Step 2: Create a Daily.co Room
Daily.co rooms are ephemeral — you create them via API, get a URL, and use that URL to join. Here’s a helper script:
# create_room.py
import os
import requests
from dotenv import load_dotenv
load_dotenv()
def create_room() -> str:
resp = requests.post(
"https://api.daily.co/v1/rooms",
headers={"Authorization": f"Bearer {os.environ['DAILY_API_KEY']}"},
json={"properties": {"enable_prejoin_ui": False, "exp": 3600}},
)
resp.raise_for_status()
url = resp.json()["url"]
print(f"Room created: {url}")
return url
if __name__ == "__main__":
create_room()
Run it:
python create_room.py
# Room created: https://your-subdomain.daily.co/some-room-name
Keep that URL — you’ll pass it to the bot in the next step.
Step 3: Build the Bot
Here’s the complete agent. We’ll walk through each section after the full listing.
# bot.py
"""
Voice agent using Pipecat + AssemblyAI Universal-3 Pro Streaming.
Stack:
Transport — Daily.co WebRTC
STT — AssemblyAI Universal-3 Pro Streaming (u3-rt-pro)
LLM — OpenAI GPT-4o with streaming
TTS — Cartesia Sonic
"""
import asyncio
import os
import sys
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAIConnectionParams
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.services.daily import DailyParams, DailyTransport
load_dotenv()
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
SYSTEM_PROMPT = """
You are a friendly, helpful voice assistant.
Keep responses under 2–3 sentences. Speak naturally — no markdown, no lists, no bullet points.
""".strip()
async def main(room_url: str, token: str | None = None):
# Transport
transport = DailyTransport(
room_url,
token,
"Voice Assistant",
DailyParams(
audio_out_enabled=True,
transcription_enabled=False, # We use AssemblyAI, not Daily transcription
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
vad_audio_passthrough=True,
),
)
# STT: AssemblyAI Universal-3 Pro Streaming
stt = AssemblyAISTTService(
connection_params=AssemblyAIConnectionParams(
api_key=os.environ["ASSEMBLYAI_API_KEY"],
speech_model="u3-rt-pro",
end_of_turn_confidence_threshold=0.7,
min_end_of_turn_silence_when_confident=300,
max_turn_silence=1000,
)
)
# LLM
llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o")
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)
# TTS
tts = CartesiaTTSService(
api_key=os.environ["CARTESIA_API_KEY"],
voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",
)
# Transcript logging
transcript = TranscriptProcessor()
@transcript.event_handler("on_transcript_update")
async def on_transcript_update(processor, frame):
for msg in frame.messages:
logger.info(f"[{msg.role}] {msg.content}")
# Pipeline
pipeline = Pipeline(
[
transport.input(),
stt,
transcript.user(),
context_aggregator.user(),
llm,
tts,
transport.output(),
context_aggregator.assistant(),
transcript.assistant(),
]
)
task = PipelineTask(
pipeline,
PipelineParams(allow_interruptions=True),
)
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
await transport.capture_participant_transcription(participant["id"])
await task.queue_frames([context_aggregator.user().get_context_frame()])
logger.info(f"Participant joined: {participant['id']}")
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
await task.queue_frame(EndFrame())
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Pipecat voice agent")
parser.add_argument("--url", required=True, help="Daily.co room URL")
parser.add_argument("--token", default=None, help="Daily.co meeting token (optional)")
args = parser.parse_args()
asyncio.run(main(args.url, args.token))
Step 4: Run It
python bot.py --url https://your-subdomain.daily.co/some-room-name
Open that URL in your browser, grant microphone access, and start talking. The bot will respond in real time.
Code Walkthrough
The Transport Layer
transport = DailyTransport(
room_url,
token,
"Voice Assistant",
DailyParams(
audio_out_enabled=True,
transcription_enabled=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
vad_audio_passthrough=True,
),
)
We disable Daily’s built-in transcription (transcription_enabled=False) because we’re routing audio to AssemblyAI instead. Silero VAD runs locally to filter out silence before it ever leaves the machine — this reduces latency and API cost.
The vadaudiopassthrough=True flag is important: even though VAD is gating what gets counted as speech, the raw audio still flows through to AssemblyAI. The STT model does its own analysis on that audio; VAD just keeps us from processing dead silence.
Turn Detection
stt = AssemblyAISTTService(
connection_params=AssemblyAIConnectionParams(
api_key=os.environ["ASSEMBLYAI_API_KEY"],
speech_model="u3-rt-pro",
end_of_turn_confidence_threshold=0.7,
min_end_of_turn_silence_when_confident=300,
max_turn_silence=1000,
)
)
The three turn detection parameters work together:
- endofturnconfidencethreshold=0.7 — the model needs 70% confidence that the user has finished their turn before it signals turn end. Lower this if you want faster (but potentially premature) responses.
- minendofturnsilencewhenconfident=300 — even when confidence is high, wait at least 300ms of silence before firing. Prevents cutting off speakers who pause for emphasis.
- maxturnsilence=1000 — after 1 second of silence, force a turn end regardless of confidence. The safety valve.
This is the part that actually matters for voice agent quality. Most latency problems in voice agents aren’t the STT or LLM — they’re bad turn detection that either interrupts the user too early or waits too long after they’ve finished.
Neural turn detection (which AssemblyAI calls “Universal Turn Detection”) differs from VAD-based approaches in that it’s trained to understand conversational intent, not just audio energy. It handles cases like trailing off mid-sentence but clearly completing a thought, or pausing to think before continuing.
The Pipeline
pipeline = Pipeline(
[
transport.input(),
stt,
transcript.user(),
context_aggregator.user(),
llm,
tts,
transport.output(),
context_aggregator.assistant(),
transcript.assistant(),
]
)
Pipecat pipelines are lists of processors. Frames flow left to right (or down the list). Each processor handles the frame types it understands and passes everything else through.
The ordering after the LLM matters: TTS comes before transport.output() because the LLM emits text chunks that need to be synthesized before they can be played. The context aggregators bookend the LLM so conversation history is maintained correctly.
allow_interruptions=True in PipelineTask lets the user cut off the agent mid-response — the TTS stream is cancelled, the audio buffer is flushed, and the pipeline is ready for new input.
Tuning for Your Use Case
Domain-Specific Vocabulary
AssemblyAI’s keyterm prompting lets you boost recognition accuracy for specialized terms without retraining anything:
stt = AssemblyAISTTService(
connection_params=AssemblyAIConnectionParams(
api_key=os.environ["ASSEMBLYAI_API_KEY"],
speech_model="u3-rt-pro",
keyterms_prompt=["Kubernetes", "kubectl", "etcd", "kube-proxy"],
)
)
Up to 1,000 terms per session. Useful for medical, legal, financial, and developer tooling applications where the default model might mishear jargon.
Multi-Speaker Sessions
connection_params=AssemblyAIConnectionParams(
api_key=os.environ["ASSEMBLYAI_API_KEY"],
speech_model="u3-rt-pro",
speaker_labels=True,
max_speakers=2,
)
Each transcript frame will include a speaker ID, which you can use to route different speakers to different LLM contexts or handle them differently in your logic.
Auto-Detect Language
connection_params=AssemblyAIConnectionParams(
api_key=os.environ["ASSEMBLYAI_API_KEY"],
speech_model="u3-rt-pro",
language_detection=True,
)
AssemblyAI will detect and switch languages per turn. Supported: English, Spanish, French, German, Italian, Portuguese.
Deploying to Production
For development, running bot.py locally and joining from the browser is fine. For production you need the bot process running somewhere reachable.
PipecatCloud is the managed option — it handles scaling, session management, and routing:
pip install pipecatcloud
pcc auth login
pcc init
pcc secrets set my-agent-secrets --file .env
pcc deploy
If you’re self-hosting, the bot is a standard Python process that can run in a container. The only requirement is network egress to the Daily.co, AssemblyAI, OpenAI, and Cartesia APIs.
What to Build Next
This agent is intentionally minimal. Here are the obvious extensions:
Memory across sessions — persist the conversation context to a database keyed by user ID and reload it on reconnect. The OpenAILLMContext object is just a list of messages; serialize and deserialize it however you want.
Function calling — OpenAILLMService supports tool use. Define tools to let the agent look up information, trigger actions, or call external APIs mid-conversation.
Custom TTS voice — Cartesia lets you clone voices. If you’re building a branded assistant, you can swap the voice_id for a custom cloned voice.
Swap the LLM — Pipecat has built-in integrations for Anthropic, Google Gemini, and others. Swap out OpenAILLMService for a different provider without touching the rest of the pipeline.
Analytics — the TranscriptProcessor gives you a clean stream of conversation turns. Feed it to a logging system, a database, or an analytics pipeline.
Full Code
The complete working example is on GitHub: https://github.com/kelsey-aai/voice-agent-pipecat-universal-3-pro
It includes a .env.example, the create_room.py helper, and the full bot.py with comments.
Resources
- Pipecat documentation
- Pipecat GitHub
- AssemblyAI Pipecat integration guide
- AssemblyAI Universal-3 Pro Streaming
- Daily.co documentation
- Cartesia documentation