Back to Podcast Digest
AI Engineer19m

Voice AI: when is the "Her" moment? — Neil Zeghidour, Gradium AI

TL;DR

  • We are still not at the 'Her' moment, even with impressive demos from ElevenLabs, Gemini, and Gradium — Neil Zeghidour says today’s voice agents sound more natural than before, but latency, interruption handling, and real conversational flow still fall short of the Samantha fantasy from the 2013 film.

  • The real bottleneck is no longer just TTS latency — it’s tool calls that take 500 milliseconds to 4 seconds — he argues the industry is obsessing over shaving 10–20 ms off speech generation while agents still stall waiting on APIs, routers, and external systems.

  • Most 'speech-to-speech' systems are still half duplex, which means they cannot behave like humans in overlapping conversation — Zeghidour highlights that humans constantly backchannel with 'mhm' and interruptions, with overlap reaching roughly 20% of conversation time in some languages like Japanese.

  • Gradium’s Moshi is pitched as the missing interaction layer: full duplex, robust to interruption, and able to speak while listening — in his demo, the model starts answering before a sentence ends and still processes new user input, which he says is what makes it feel genuinely conversational.

  • Natural turn-taking alone is not enough — full-duplex models still lose to cascaded systems on intelligence, tool use, observability, and production reliability — Zeghidour says Moshi had unmatched conversational flow but was 'very stupid' and not deployable as an agent without the control and visibility teams need.

  • Cost and privacy may decide who wins consumer voice, which is why Gradium built a sub-100M-parameter TTS model that runs on a smartphone CPU — he claims many voice startups burn fundraising on TTS bills, while their new on-device model 'Phonon' avoids API fees and keeps sensitive voice data local.

The Breakdown

Opening shot: Gradium wants to be the voice model layer

Neil Zeghidour opens by positioning Gradium as the infrastructure company for voice AI: speech-to-text, text-to-speech, speech-to-speech, translation, and dialogue — not orchestration or vertical apps. He kicks things off with a voice clone of podcaster Joel, made from about 10 seconds of audio, to make the point that the raw model quality is already getting pretty wild.

The 'Her' benchmark is overused — and still annoyingly relevant

He jokes that the 'Her moment' has become the most overused analogy in AI, then immediately admits it remains the right benchmark because the movie still feels ahead of us 13 years later. He plays the Samantha naming scene, then contrasts that with recent voice demos from ElevenLabs and his own Gemini-based gym-bro persona to show the current state: better, more usable, but still obviously not human.

Cascaded systems are fast now, but nowhere near human conversational timing

Neil runs through the classic stack — streaming speech-to-text, LLM, streaming TTS, voice cloning, VAD — and says even strong systems are boxed in by physics and architecture. If TTS alone still takes more than 200 ms, and a human conversation expects the entire loop of understanding, thinking, and speaking to happen in about 200 ms, then a human-like exchange is simply not possible with today’s cascade.

Tool calls are the latency monster nobody can ignore

His sharpest point is that the industry may be optimizing the wrong thing: while teams fight for tiny TTS gains, one tool call through something like OpenRouter can add 500 ms to 4 seconds. His proposed workaround is 'fillers' — keep the conversation alive while the tool runs, then smoothly insert the result — which he demonstrates with a rough live-coded travel agent that chats about Tokyo while fetching options.

Why 'speech-to-speech' still doesn’t mean real conversation

He says a lot of people treat speech-to-speech as the answer, but most systems are still half duplex: they either listen or speak, never both. That breaks on basic human behavior like overlap, coughing, or backchanneling, and he hilariously demonstrates this by talking to a model that keeps misreading his 'yeah, exactly' interjections as interruptions and derailing the exchange.

Moshi’s full-duplex demo shows what actually feels human

In contrast, he shows his co-founder Alex talking over Moshi in a two-year-old demo that still 'aged well': both sides interrupt, overlap, and recover naturally. That, he argues, is the real interaction breakthrough — not just speech output, but robustness to noise, multiple speakers, and the ambiguity of live conversation.

But full duplex alone is not enough to ship a product

Then he pulls back and makes a more sober point: Moshi’s flow was incredible, but the model itself was 'very stupid' and basically useless after a few minutes because it had no tools, no observability, and no practical production controls. He also points to paralinguistic understanding — tone, discomfort, emotional cues — as something speech models technically contain, but won’t use unless they’re explicitly trained to make that information matter.

The last fight is cost, privacy, and on-device voice

He ends on economics and deployment reality: voice is still expensive, often run at a loss by hyperscalers, and TTS is the line item that can torch a startup’s runway. Gradium’s answer is Phonon, an on-device TTS model under 100 million parameters that runs on a smartphone CPU, with local inference, voice cloning, no API fees, and better quality than existing on-device options like Kokoro in his telling. The closing message is simple: voice is not a commodity, the 'last mile' is the hardest part, and getting to 'Her' will take real science and engineering.

Share