AI EngineerMay 31, 202624m

Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI

TL;DR

Voice agents live or die by sub-second timing: Humans react to conversational cues in about 300 milliseconds, and Bhargava says once an AI takes more than 500 milliseconds to respond, users notice, while 1 to 2 seconds is enough to make people hang up.
The production default is still a pipeline, not one magic model: Most deployed systems stream audio into speech-to-text, pass text to an LLM for reasoning and tool calls, then send output through text-to-speech, often orchestrated by frameworks like Pipecat or LiveKit.
Model size is constrained by latency budgets: For the LLM tier, Bhargava says 200 to 300 milliseconds TTFT usually pushes teams toward models in the 8B to 30B range, because larger models miss latency targets and smaller ones often fall short on intelligence and tool calling.
Co-location matters more than people think: Dropping network latency from 75 milliseconds to 5 milliseconds by putting the orchestrator and models in the same data center can cut total voice-agent latency by about 30%, even when the models themselves are already fast.
Speech quality failures cascade through the whole system: State-of-the-art speech-to-text may hit roughly 6% word error rate on benchmarks, but if the model gets a customer's name or a drug name wrong, the LLM and TTS will faithfully carry that mistake forward.
Speech-to-speech is promising but not production-ready for many workflows: Bhargava points to OpenAI's realtime API and Nvidia's Voice Chat as examples, but says these single-model systems still struggle with instruction following and tool calling, which is why teams often fall back to pipeline architectures.

The Breakdown

A 75 millisecond network hop between voice components can eat 30% of an already optimized agent's latency budget, which is why Rishabh Bhargava argues voice AI is now mostly an engineering problem, not a research one. He walks through the production stack for real-time voice agents, from STT and LLM sizing to TTS, autoscaling, co-location, and why pure speech-to-speech systems still struggle with tool calling.