How to Evaluate STT for Voice Agents in Production

When you’re building a voice agent, the benchmarks you reach for matter. Get them wrong, and you optimise for the wrong thing. You ship a system that feels broken in ways you can’t immediately diagnose, and end up chasing ghosts in your latency numbers.

There are only a handful of public STT benchmarks. Artificial Analysis is one of the few independent ones (but they currently don’t do real-time). After that, you’re mostly left with metrics from the providers themselves, which tend to look… favorable, shall we say.

And most of them are measuring the wrong thing anyway.

The most important thing to remember: any latency metric needs the context of accuracy. I would rather talk to a bot that takes 100ms longer to respond than have to repeat myself because it transcribes something wrong.

And not just accuracy. Reliability and cost have to be factored in when you’re actually making a provider decision.

Speed you can trust. That’s the target. Not just a fast TTFB.


What I’ll cover in this blog post:

  • The metric that actually matters for voice agents: TTFS
  • Why TTFB is the worst metric in the industry
  • Pipecat evals: the best voice agent benchmark so far
  • Try it yourself
  • Final thoughts
  • Full glossary of STT latency terms


The metric that actually matters for voice agents: TTFS

For cascade voice agents, voice agent latency (sometimes called speech-to-speech latency) is the time from a user finishing speaking to the agent starting to speak back. In that window, a lot of things have to happen. Turn detection, final transcript delivery, RAG if you’re using it, LLM call, TTS streaming.

The transcription provider’s share of that is TTFS: time to final segment. The time from the user finishing speech to the final, stable transcript arriving. The thing that actually gets passed to your LLM.

You’ll also hear it called EoT latency (end of turn), though strictly that’s better reserved for when turn completion is detected. Pipecat calls it TTFS publicly and TTFB in their code. Helpful.

In our current system, a framework like Pipecat or LiveKit detects the end of turn in around 200ms, sends a forceendof_utterance message, and finals come back shortly after. Our internal tooling measures end-of-speech to finals latency at 0.451 ± 0.022s for our Voice SDK with built-in turn detection.

Everything else, RTF, TTFB, partial latency, and final latency, matters less than this number for voice agents. Some of those metrics are actively misleading. More on that below.


Why TTFB is a poor metric to bank on

In my opinion, anyway. But I see it cited over and over again, so it’s worth explaining why it’s so bad.

TTFB (time to first byte, sometimes TTFT for time to first token) measures the delay from when you start streaming audio to when you get the very first piece of transcript back. The problem: how long is a word? If the engine fires back “Super” almost immediately when someone starts saying “Supercalifragilisticexpialidocious,” that’s technically fast. It’s also useless. Your agent can’t act on “Super.”

The extreme version of this is Deepgram Flux, as benchmarked by Coval.ai. Their data shows “Two” coming back first regardless of what’s actually being said. Likely a statistical bias toward common first words in their training data. Very low TTFB. Completely non-actionable. Your stack still has to wait for a correction before it can do anything.

Firing back a volatile guess doesn’t make a service fast. It just means the rest of your AI stack has to wait anyway.

TTFB is popular because it maps naturally from LLM and TTS benchmarking, where the first token genuinely signals useful output. In transcription that analogy breaks down. The first byte is often just a guess that might change a millisecond later. It’s not a reliable indicator of when your agent can actually start working.

Coval.ai has independent TTFB measurements worth looking at. Just pay attention to what’s actually in that first byte, not only how fast it arrives.


Pipecat evals: the best voice agent benchmark so far

| Service | Transcripts | Perfect | WER Mean | Pooled WER | TTFS Median | TTFS P95 | TTFS P99 |
|—-|—-|—-|—-|—-|—-|—-|—-|
| AssemblyAI | 99.8% | 66.8% | 3.49% | 3.02% | 256ms | 362ms | 417ms |
| AWS | 100.0% | 77.4% | 1.68% | 1.75% | 1136ms | 1527ms | 1897ms |
| Azure | 100.0% | 82.9% | 1.21% | 1.18% | 1016ms | 1345ms | 1791ms |
| Cartesia | 99.9% | 60.5% | 3.92% | 4.36% | 266ms | 364ms | 898ms |
| Deepgram | 99.8% | 76.5% | 1.71% | 1.62% | 247ms | 298ms | 326ms |
| ElevenLabs | 99.7% | 81.3% | 3.16% | 3.12% | 281ms | 348ms | 407ms |
| Google | 100.0% | 69.0% | 2.84% | 2.85% | 878ms | 1155ms | 1570ms |
| OpenAI | 99.3% | 75.9% | 3.24% | 3.06% | 637ms | 965ms | 1655ms |
| Speechmatics | 99.7% | 83.2% | 1.40% | 1.07% | 495ms | 676ms | 736ms |

Full Pipecat benchmarks →

Two axes: semantic WER and median TTFS. Both matter. Neither works without the other.

The dataset

Pipecat built this eval set originally to train their Smart Turn V3 turn detection model. It’s a set of short utterances, some cut off and some finishing naturally as an end of turn. That’s a closer approximation to real production audio than most STT test sets, which tend toward clean studio recordings.

Semantic WER

Standard WER counts wrong words. Semantic WER asks a different question: can the LLM understand what was meant based on what was transcribed? That’s the right question for voice agents. Your agent doesn’t fail because a word was slightly off. It fails because the LLM got the wrong meaning and did the wrong thing.

To calculate it: audio goes through Google’s transcription to establish a ground truth (this is debatable, why them specifically? But the outputs can be human corrected). Then a large custom prompt asks Claude to compare the meaning of a transcription against that ground truth and compute error per word based on meaning, not exact wording.

That’s how you measure accuracy for voice agents.

How TTFS is measured

Pipecat’s turn detection pipeline uses two models. Silero VAD (is someone speaking right now?) and Smart Turn V3 (is the turn actually complete?). Once VAD drops for stop_secs, Smart Turn runs on the audio and decides if the user is done. If it decides yes, a message goes to the STT provider to finalise.

TTFS is measured from when VAD initially goes low to when finals come back. That captures real end-of-speech to final transcript latency, including network time, which is exactly what contributes to your voice agent’s response delay.

Our internal approach uses force alignment to pinpoint exact word timing, which is more precise. But the Pipecat approach is reproducible by anyone, which matters more for a benchmark.


On where Speechmatics sits

The obvious question you may be asking: Speechmatics‘ median TTFS is 495ms. Deepgram is 247ms. AssemblyAI is 256ms. Why is that a win?

Look at the accuracy column. 83.2% of turns came back completely perfect. That’s the highest on the board. Pooled WER of 1.07%, also the lowest.

The pareto curve is what matters here. If users have to repeat themselves because the transcript was wrong, you’ve added more perceived latency than any 250ms difference in TTFS would have saved. A faster wrong answer is still a wrong answer. I would take the milliseconds anyday.

It’s also worth noting that accuracy isn’t only an English problem. Domain accuracy across medical transcription, accents, and non-native speakers matters enormously in production, and barely surfaces in most benchmarks. Our latency is consistent across languages and domains, which the Pipecat eval doesn’t fully capture.

One more thing for anyone optimising on cost: the Standard model is cheaper than Deepgram, more accurate, and has the same improved latency as the Enhanced model.


Try it yourself

The Pipecat example in the Speechmatics Academy has been updated to Pipecat 0.101, which now includes live TTFS measurement for transcription services.

To test it out:

  1. Clone the repo: https://github.com/speechmatics/speechmatics-academy/tree/main/integrations/pipecat/02-simple-voice-bot-web
  2. Grab a free API key from Speechmatics
  3. Spin up the bot

Use the endpoint closest to your location for lowest latency. EU is the default. As you talk, the metrics tab updates with live TTFS per turn.


Conclusion

Pipecat has built the most useful public benchmark for voice agents so far. Two axes, real-world turn data, semantic accuracy rather than raw WER. If you’re evaluating STT providers, start here.

But as much as I think about latency, I keep coming back to the same conclusion: accuracy matters more.

Latency has reached a point where you can have a comfortable conversation with a voice agent. People who say “latency is UX” aren’t wrong. But repeating yourself because the system mishears you is far more annoying than the gap.

Right now, the transcription latency that matters for voice agents is end-of-speech to finals. That might not be the case in a year. Speculative generation is getting more capable. Turn detection is going to get more layered, split across transcription, orchestration, and the LLM, each adding its own backstop, with more of it getting absorbed into the transcription layer itself.

Fast matters. Fast, reliable, and accurate across languages, accents, and domains matters more.


Appendix: STT latency terms, explained

A full breakdown for reference, ordered from least to most relevant for voice agents.

Partials (interim transcripts)

The current best guess. Volatile, subject to change. Emitted sometimes before you’ve even finished a word. Useful for UI visualisation and for speculative generation, sending to the LLM before the turn ends. Not the ground truth you act on.

Finals

Stable, committed transcripts. Won’t change once emitted. Take longer to arrive because the engine needs confidence that the word is complete and surrounding context won’t shift its interpretation. This is what you pass to your LLM.

Turns

One half of a conversation. Detecting the boundary between a mid-sentence pause and a completed thought is a hard real-time engineering problem. Frameworks like Pipecat and LiveKit use dedicated turn detection models to decide when to close the turn and trigger the LLM.

RTF (real-time factor)

Not a latency, a speed ratio. Time to transcribe a second of audio. RTF of 0.05 means a 100-second file came back in 5 seconds. For real-time applications, as long as RTF is below 1 you’re fine. Mostly relevant for batch benchmarks. Ignore for voice agent work.

TTFB / TTFT (time to first byte / token)

Time from audio stream start to first transcript fragment. Misleading for voice agents as explained above. The first byte is often a volatile guess. Check Coval.ai for independent measurements, but weight them accordingly.

Partial latency

Time from finishing a word to receiving the first partial relating to it. What most people feel when watching a transcription service in action. Useful for perceived responsiveness and speculative generation. Not the number that drives agent response time.

Finals latency

Time from word completion to the relevant final arriving. At Speechmatics this is controlled with the max_delay API parameter. Minimum 700ms to preserve accuracy, recommended around 1.5s for voice agents. Relevant for live captioning and running transcripts.

TTFS (time to final segment)

Time from the user finishing speech to the final transcript arriving. The transcription provider’s direct contribution to voice agent response latency. The number that matters.


:::info
A note on algorithmic vs network latency: the latency figures above represent algorithmic latency, what you’d measure on a direct connection. The Pipecat evaluations include real-world network latency, which is the honest way to run these tests.

:::

Liked Liked