How to Build a Real-Time Voice Agent with Pipecat

digitado ⋅ 14 de April de 2026

Two years ago, building a production-grade voice bot meant hand-rolling WebSocket state machines, managing audio buffers yourself, and debugging race conditions between your speech-to-text and your LLM. Today you can wire it all together in about 100 lines of Python.

This tutorial uses Pipecat — Daily.co’s open-source Voice AI framework — to build a real-time voice agent that listens, thinks, and speaks. We’ll use AssemblyAI’s Universal-3 Pro Streaming model for speech-to-text, GPT-4o for the language model, and Cartesia Sonic for text-to-speech.

By the end you’ll have a working voice agent running in a WebRTC room with proper turn detection, interruption handling, and live transcript logging.

Why Pipecat?

Before Pipecat, the “standard” approach to voice agents involved something like:

Open a WebSocket to a streaming STT provider
Stream audio chunks, receive partial transcripts
Detect end-of-turn with a VAD or timer hack
Fire off an LLM request
Stream TTS audio back while buffering it into a playable format
Hope none of that races with an interruption

Every new voice agent project started with the same 300-line infrastructure file that nobody wanted to maintain.

Pipecat solves this by giving you a typed, event-driven pipeline abstraction. You define a sequence of processors — transport, STT, LLM, TTS, transport output — and Pipecat handles frame routing, backpressure, interruptions, and lifecycle. Swap one processor for another without touching anything else.

The framework is designed specifically for voice AI (not a general pub/sub system bolted onto audio), which means it gets the hard parts right: partial transcripts don’t prematurely trigger LLM calls, TTS can be interrupted mid-sentence, and audio is buffered correctly for WebRTC. n

Architecture Overview

Our agent has four moving parts:

Daily.co WebRTC room

│ audio

▼

Pipecat pipeline

┌───────────────────────────────────────────────┐

│ transport.input() │

│ │ │

│ AssemblyAI Universal-3 Pro Streaming (STT) │

│ │ transcript + turn signal │

│ TranscriptProcessor │

│ │ │

│ OpenAI GPT-4o (streaming) │

│ │ text chunks │

│ Cartesia Sonic (TTS) │

│ │ audio │

│ transport.output() │

└───────────────────────────────────────────────┘

Daily.co handles WebRTC — browser-compatible, no STUN/TURN configuration required
AssemblyAI Universal-3 Pro Streaming handles STT with neural turn detection
GPT-4o handles the conversation
Cartesia Sonic handles TTS

The key architectural detail is that AssemblyAI emits a turn-end signal — not just raw transcript text. The pipeline only forwards a completed user turn to the LLM when AssemblyAI’s neural model is confident the speaker has finished. This is meaningfully different from VAD (voice activity detection), which triggers on silence, not on conversational intent.

Prerequisites

Python 3.11+
AssemblyAI API key — app.assemblyai.com
Daily.co API key — dashboard.daily.co
OpenAI API key
Cartesia API key — play.cartesia.ai

Step 1: Install Dependencies

python -m venv .venv && source .venv/bin/activate

pip install "pipecat-ai[assemblyai,cartesia,openai,daily,silero]>=0.0.47" python-dotenv loguru

The pipecat-ai extras pull in the official service integrations. The silero extra installs the Silero VAD model, which we use for pre-filtering silence before sending audio to AssemblyAI.

Create a .env file:

ASSEMBLYAI_API_KEY=your_key_here
DAILY_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
CARTESIA_API_KEY=your_key_here

Step 2: Create a Daily.co Room

Daily.co rooms are ephemeral — you create them via API, get a URL, and use that URL to join. Here’s a helper script:

# create_room.py
import os
import requests
from dotenv import load_dotenv
&nbsp;
load_dotenv()
&nbsp;
def create_room() -> str:
&nbsp;&nbsp;&nbsp;&nbsp;resp = requests.post(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"https://api.daily.co/v1/rooms",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;headers={"Authorization": f"Bearer {os.environ['DAILY_API_KEY']}"},
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;json={"properties": {"enable_prejoin_ui": False, "exp": 3600}},
&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;&nbsp;&nbsp;&nbsp;resp.raise_for_status()
&nbsp;&nbsp;&nbsp;&nbsp;url = resp.json()["url"]
&nbsp;&nbsp;&nbsp;&nbsp;print(f"Room created: {url}")
&nbsp;&nbsp;&nbsp;&nbsp;return url
&nbsp;
&nbsp;
if __name__ == "__main__":
&nbsp;&nbsp;&nbsp;&nbsp;create_room()

Run it:

python create_room.py
# Room created: https://your-subdomain.daily.co/some-room-name

Keep that URL — you’ll pass it to the bot in the next step.

Step 3: Build the Bot

Here’s the complete agent. We’ll walk through each section after the full listing.

# bot.py
"""
Voice agent using Pipecat + AssemblyAI Universal-3 Pro Streaming.
&nbsp;
Stack:
&nbsp;&nbsp;Transport — Daily.co WebRTC
&nbsp;&nbsp;STT &nbsp; &nbsp; &nbsp; — AssemblyAI Universal-3 Pro Streaming (u3-rt-pro)
&nbsp;&nbsp;LLM &nbsp; &nbsp; &nbsp; — OpenAI GPT-4o with streaming
&nbsp;&nbsp;TTS &nbsp; &nbsp; &nbsp; — Cartesia Sonic
"""
&nbsp;
import asyncio
import os
import sys
&nbsp;
from dotenv import load_dotenv
from loguru import logger
&nbsp;
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import EndFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAIConnectionParams
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.services.daily import DailyParams, DailyTransport
&nbsp;
load_dotenv()
&nbsp;
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
&nbsp;
SYSTEM_PROMPT = """
You are a friendly, helpful voice assistant.
Keep responses under 2–3 sentences. Speak naturally — no markdown, no lists, no bullet points.
""".strip()
&nbsp;
&nbsp;
async def main(room_url: str, token: str | None = None):
&nbsp;&nbsp;&nbsp;&nbsp;# Transport
&nbsp;&nbsp;&nbsp;&nbsp;transport = DailyTransport(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;room_url,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;token,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"Voice Assistant",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DailyParams(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;audio_out_enabled=True,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transcription_enabled=False,&nbsp; # We use AssemblyAI, not Daily transcription
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vad_enabled=True,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vad_analyzer=SileroVADAnalyzer(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vad_audio_passthrough=True,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;),
&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;# STT: AssemblyAI Universal-3 Pro Streaming
&nbsp;&nbsp;&nbsp;&nbsp;stt = AssemblyAISTTService(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;connection_params=AssemblyAIConnectionParams(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;api_key=os.environ["ASSEMBLYAI_API_KEY"],
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;speech_model="u3-rt-pro",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;end_of_turn_confidence_threshold=0.7,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_end_of_turn_silence_when_confident=300,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;max_turn_silence=1000,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;# LLM
&nbsp;&nbsp;&nbsp;&nbsp;llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o")
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;messages = [{"role": "system", "content": SYSTEM_PROMPT}]
&nbsp;&nbsp;&nbsp;&nbsp;context = OpenAILLMContext(messages)
&nbsp;&nbsp;&nbsp;&nbsp;context_aggregator = llm.create_context_aggregator(context)
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;# TTS
&nbsp;&nbsp;&nbsp;&nbsp;tts = CartesiaTTSService(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;api_key=os.environ["CARTESIA_API_KEY"],
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",
&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;# Transcript logging
&nbsp;&nbsp;&nbsp;&nbsp;transcript = TranscriptProcessor()
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;@transcript.event_handler("on_transcript_update")
&nbsp;&nbsp;&nbsp;&nbsp;async def on_transcript_update(processor, frame):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for msg in frame.messages:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;logger.info(f"[{msg.role}] {msg.content}")
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;# Pipeline
&nbsp;&nbsp;&nbsp;&nbsp;pipeline = Pipeline(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transport.input(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;stt,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transcript.user(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;context_aggregator.user(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;llm,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tts,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transport.output(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;context_aggregator.assistant(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transcript.assistant(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;]
&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;task = PipelineTask(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pipeline,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PipelineParams(allow_interruptions=True),
&nbsp;&nbsp;&nbsp;&nbsp;)
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;@transport.event_handler("on_first_participant_joined")
&nbsp;&nbsp;&nbsp;&nbsp;async def on_first_participant_joined(transport, participant):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;await transport.capture_participant_transcription(participant["id"])
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;await task.queue_frames([context_aggregator.user().get_context_frame()])
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;logger.info(f"Participant joined: {participant['id']}")
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;@transport.event_handler("on_participant_left")
&nbsp;&nbsp;&nbsp;&nbsp;async def on_participant_left(transport, participant, reason):
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;await task.queue_frame(EndFrame())
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;runner = PipelineRunner()
&nbsp;&nbsp;&nbsp;&nbsp;await runner.run(task)
&nbsp;
&nbsp;
if __name__ == "__main__":
&nbsp;&nbsp;&nbsp;&nbsp;import argparse
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;parser = argparse.ArgumentParser(description="Pipecat voice agent")
&nbsp;&nbsp;&nbsp;&nbsp;parser.add_argument("--url", required=True, help="Daily.co room URL")
&nbsp;&nbsp;&nbsp;&nbsp;parser.add_argument("--token", default=None, help="Daily.co meeting token (optional)")
&nbsp;&nbsp;&nbsp;&nbsp;args = parser.parse_args()
&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;asyncio.run(main(args.url, args.token))

Step 4: Run It

python bot.py --url https://your-subdomain.daily.co/some-room-name

Open that URL in your browser, grant microphone access, and start talking. The bot will respond in real time.

Code Walkthrough

The Transport Layer

transport = DailyTransport(
&nbsp;&nbsp;&nbsp;&nbsp;room_url,
&nbsp;&nbsp;&nbsp;&nbsp;token,
&nbsp;&nbsp;&nbsp;&nbsp;"Voice Assistant",
&nbsp;&nbsp;&nbsp;&nbsp;DailyParams(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;audio_out_enabled=True,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transcription_enabled=False,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vad_enabled=True,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vad_analyzer=SileroVADAnalyzer(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vad_audio_passthrough=True,
&nbsp;&nbsp;&nbsp;&nbsp;),
)

We disable Daily’s built-in transcription (transcription_enabled=False) because we’re routing audio to AssemblyAI instead. Silero VAD runs locally to filter out silence before it ever leaves the machine — this reduces latency and API cost.

The vadaudiopassthrough=True flag is important: even though VAD is gating what gets counted as speech, the raw audio still flows through to AssemblyAI. The STT model does its own analysis on that audio; VAD just keeps us from processing dead silence.

Turn Detection

stt = AssemblyAISTTService(
&nbsp;&nbsp;&nbsp;&nbsp;connection_params=AssemblyAIConnectionParams(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;api_key=os.environ["ASSEMBLYAI_API_KEY"],
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;speech_model="u3-rt-pro",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;end_of_turn_confidence_threshold=0.7,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;min_end_of_turn_silence_when_confident=300,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;max_turn_silence=1000,
&nbsp;&nbsp;&nbsp;&nbsp;)
)

The three turn detection parameters work together:

endofturnconfidencethreshold=0.7 — the model needs 70% confidence that the user has finished their turn before it signals turn end. Lower this if you want faster (but potentially premature) responses.
minendofturnsilencewhenconfident=300 — even when confidence is high, wait at least 300ms of silence before firing. Prevents cutting off speakers who pause for emphasis.
maxturnsilence=1000 — after 1 second of silence, force a turn end regardless of confidence. The safety valve.

This is the part that actually matters for voice agent quality. Most latency problems in voice agents aren’t the STT or LLM — they’re bad turn detection that either interrupts the user too early or waits too long after they’ve finished.

Neural turn detection (which AssemblyAI calls “Universal Turn Detection”) differs from VAD-based approaches in that it’s trained to understand conversational intent, not just audio energy. It handles cases like trailing off mid-sentence but clearly completing a thought, or pausing to think before continuing.

The Pipeline

pipeline = Pipeline(
&nbsp;&nbsp;&nbsp;&nbsp;[
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transport.input(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;stt,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transcript.user(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;context_aggregator.user(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;llm,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tts,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transport.output(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;context_aggregator.assistant(),
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;transcript.assistant(),
&nbsp;&nbsp;&nbsp;&nbsp;]
)

Pipecat pipelines are lists of processors. Frames flow left to right (or down the list). Each processor handles the frame types it understands and passes everything else through.

The ordering after the LLM matters: TTS comes before transport.output() because the LLM emits text chunks that need to be synthesized before they can be played. The context aggregators bookend the LLM so conversation history is maintained correctly.

allow_interruptions=True in PipelineTask lets the user cut off the agent mid-response — the TTS stream is cancelled, the audio buffer is flushed, and the pipeline is ready for new input.

Tuning for Your Use Case

Domain-Specific Vocabulary

AssemblyAI’s keyterm prompting lets you boost recognition accuracy for specialized terms without retraining anything:

stt = AssemblyAISTTService(
&nbsp;&nbsp;&nbsp;&nbsp;connection_params=AssemblyAIConnectionParams(
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;api_key=os.environ["ASSEMBLYAI_API_KEY"],
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;speech_model="u3-rt-pro",
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;keyterms_prompt=["Kubernetes", "kubectl", "etcd", "kube-proxy"],
&nbsp;&nbsp;&nbsp;&nbsp;)
)

Up to 1,000 terms per session. Useful for medical, legal, financial, and developer tooling applications where the default model might mishear jargon.

Multi-Speaker Sessions

connection_params=AssemblyAIConnectionParams(
&nbsp;&nbsp;&nbsp;&nbsp;api_key=os.environ["ASSEMBLYAI_API_KEY"],
&nbsp;&nbsp;&nbsp;&nbsp;speech_model="u3-rt-pro",
&nbsp;&nbsp;&nbsp;&nbsp;speaker_labels=True,
&nbsp;&nbsp;&nbsp;&nbsp;max_speakers=2,
)

Each transcript frame will include a speaker ID, which you can use to route different speakers to different LLM contexts or handle them differently in your logic.

Auto-Detect Language

connection_params=AssemblyAIConnectionParams(
&nbsp;&nbsp;&nbsp;&nbsp;api_key=os.environ["ASSEMBLYAI_API_KEY"],
&nbsp;&nbsp;&nbsp;&nbsp;speech_model="u3-rt-pro",
&nbsp;&nbsp;&nbsp;&nbsp;language_detection=True,
)

AssemblyAI will detect and switch languages per turn. Supported: English, Spanish, French, German, Italian, Portuguese.

Deploying to Production

For development, running bot.py locally and joining from the browser is fine. For production you need the bot process running somewhere reachable.

PipecatCloud is the managed option — it handles scaling, session management, and routing:

pip install pipecatcloud
pcc auth login
pcc init
pcc secrets set my-agent-secrets --file .env
pcc deploy

If you’re self-hosting, the bot is a standard Python process that can run in a container. The only requirement is network egress to the Daily.co, AssemblyAI, OpenAI, and Cartesia APIs.

What to Build Next

This agent is intentionally minimal. Here are the obvious extensions:

Memory across sessions — persist the conversation context to a database keyed by user ID and reload it on reconnect. The OpenAILLMContext object is just a list of messages; serialize and deserialize it however you want.

Function calling — OpenAILLMService supports tool use. Define tools to let the agent look up information, trigger actions, or call external APIs mid-conversation.

Custom TTS voice — Cartesia lets you clone voices. If you’re building a branded assistant, you can swap the voice_id for a custom cloned voice.

Swap the LLM — Pipecat has built-in integrations for Anthropic, Google Gemini, and others. Swap out OpenAILLMService for a different provider without touching the rest of the pipeline.

Analytics — the TranscriptProcessor gives you a clean stream of conversation turns. Feed it to a logging system, a database, or an analytics pipeline.

Full Code

The complete working example is on GitHub: https://github.com/kelsey-aai/voice-agent-pipecat-universal-3-pro

It includes a .env.example, the create_room.py helper, and the full bot.py with comments.

Resources

Pipecat documentation
Pipecat GitHub
AssemblyAI Pipecat integration guide
AssemblyAI Universal-3 Pro Streaming
Daily.co documentation
Cartesia documentation

Like 0

Liked Liked