
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
We are still not at the 'Her' moment, even with impressive demos from ElevenLabs, Gemini, and Gradium — Neil Zeghidour says today’s voice agents sound more natural than before, but latency, interruption handling, and real conversational flow still fall short of the Samantha fantasy from the 2013 film.
The real bottleneck is no longer just TTS latency — it’s tool calls that take 500 milliseconds to 4 seconds — he argues the industry is obsessing over shaving 10–20 ms off speech generation while agents still stall waiting on APIs, routers, and external systems.
Most 'speech-to-speech' systems are still half duplex, which means they cannot behave like humans in overlapping conversation — Zeghidour highlights that humans constantly backchannel with 'mhm' and interruptions, with overlap reaching roughly 20% of conversation time in some languages like Japanese.
Gradium’s Moshi is pitched as the missing interaction layer: full duplex, robust to interruption, and able to speak while listening — in his demo, the model starts answering before a sentence ends and still processes new user input, which he says is what makes it feel genuinely conversational.
Natural turn-taking alone is not enough — full-duplex models still lose to cascaded systems on intelligence, tool use, observability, and production reliability — Zeghidour says Moshi had unmatched conversational flow but was 'very stupid' and not deployable as an agent without the control and visibility teams need.
Cost and privacy may decide who wins consumer voice, which is why Gradium built a sub-100M-parameter TTS model that runs on a smartphone CPU — he claims many voice startups burn fundraising on TTS bills, while their new on-device model 'Phonon' avoids API fees and keeps sensitive voice data local.
Neil Zeghidour opens by positioning Gradium as the infrastructure company for voice AI: speech-to-text, text-to-speech, speech-to-speech, translation, and dialogue — not orchestration or vertical apps. He kicks things off with a voice clone of podcaster Joel, made from about 10 seconds of audio, to make the point that the raw model quality is already getting pretty wild.
He jokes that the 'Her moment' has become the most overused analogy in AI, then immediately admits it remains the right benchmark because the movie still feels ahead of us 13 years later. He plays the Samantha naming scene, then contrasts that with recent voice demos from ElevenLabs and his own Gemini-based gym-bro persona to show the current state: better, more usable, but still obviously not human.
Neil runs through the classic stack — streaming speech-to-text, LLM, streaming TTS, voice cloning, VAD — and says even strong systems are boxed in by physics and architecture. If TTS alone still takes more than 200 ms, and a human conversation expects the entire loop of understanding, thinking, and speaking to happen in about 200 ms, then a human-like exchange is simply not possible with today’s cascade.
His sharpest point is that the industry may be optimizing the wrong thing: while teams fight for tiny TTS gains, one tool call through something like OpenRouter can add 500 ms to 4 seconds. His proposed workaround is 'fillers' — keep the conversation alive while the tool runs, then smoothly insert the result — which he demonstrates with a rough live-coded travel agent that chats about Tokyo while fetching options.
He says a lot of people treat speech-to-speech as the answer, but most systems are still half duplex: they either listen or speak, never both. That breaks on basic human behavior like overlap, coughing, or backchanneling, and he hilariously demonstrates this by talking to a model that keeps misreading his 'yeah, exactly' interjections as interruptions and derailing the exchange.
In contrast, he shows his co-founder Alex talking over Moshi in a two-year-old demo that still 'aged well': both sides interrupt, overlap, and recover naturally. That, he argues, is the real interaction breakthrough — not just speech output, but robustness to noise, multiple speakers, and the ambiguity of live conversation.
Then he pulls back and makes a more sober point: Moshi’s flow was incredible, but the model itself was 'very stupid' and basically useless after a few minutes because it had no tools, no observability, and no practical production controls. He also points to paralinguistic understanding — tone, discomfort, emotional cues — as something speech models technically contain, but won’t use unless they’re explicitly trained to make that information matter.
He ends on economics and deployment reality: voice is still expensive, often run at a loss by hyperscalers, and TTS is the line item that can torch a startup’s runway. Gradium’s answer is Phonon, an on-device TTS model under 100 million parameters that runs on a smartphone CPU, with local inference, voice cloning, no API fees, and better quality than existing on-device options like Kokoro in his telling. The closing message is simple: voice is not a commodity, the 'last mile' is the hardest part, and getting to 'Her' will take real science and engineering.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.