AI EngineerJune 5, 202625m

Beyond Transcription: Building Voice AI That Understands Conversations — Hervé Bredin, pyannoteAI

TL;DR

"Who said what" is only the start: Bredin argues that useful voice AI needs "who said what and when and how," because timing, pauses, interruptions, laughter, coughing, and stress can change the meaning of a conversation.
Speaker diarization is still hard in 2026-era tooling: The task has to infer an unknown number of speakers, handle overlapping speech, and catch very short turns, which is why state-of-the-art diarization can be about 8% DER on telephone speech but around 41% in noisy restaurant settings.
Whisper created a huge opening for diarization tools: After OpenAI released Whisper without speaker labels, users started pairing it with pyannote, helping drive the open source toolkit to nearly 10,000 GitHub stars.
Benchmarks can hide the real problem: On the AMI meeting dataset, Nvidia Parakeet reports 11.4% word error rate on headset audio, but Bredin says the same model hits 26% on the far-field table microphone, where multiple speakers and overlap make transcription much harder.
Reconciling ASR and diarization is not a simple merge step: Even with word timestamps from Parakeet and speaker turns from Precision-2, words can fall between turns or inside overlaps, making speaker assignment ambiguous without additional logic.
pyannoteAI's product bet is orchestration, not just another model: Their cloud pipeline combines Precision-2 diarization with transcription and a proprietary reconciliation layer, while an open source piece of the idea appears as "exclusive diarization" that picks the most likely speaker in overlaps.

The Breakdown

A speaker diarization demo that looks trivial on paper ends up exposing the real mess of voice AI: even when transcription is good, assigning each word to the right person breaks on overlaps, interruptions, and tiny backchannels like "hm" or "okay." Hervé Bredin shows why pyannoteAI treats "who said what and when and how" as the real target, with diarization error rates ranging from about 8% on clean phone calls to 41% in noisy restaurant conversations.