Alcreon
Back to Podcast Digest
AI Engineer··40m

Judge the Judge: Building LLM Evaluators That Actually Work with GEPA — Mahmoud Mabrouk, Agenta AI

TL;DR

  • Generic 'hallucination' judges are mostly theater — Mahmoud Mabrouk opens with the classic failure mode: the dashboard says your agent is fine while customers are angry, because the judge prompt is basically 'make no mistakes' without any business-specific grounding.

  • A useful LLM judge has to be calibrated to human labels, not vibes — His core argument is that offline eval speed only matters if the signal correlates with human annotation; otherwise you just run the experiment loop faster in the wrong direction.

  • The hard part is not GEPA — it's metrics and data — On Sierra's TauBench airline support benchmark, he says the real work is defining domain-specific binary metrics like policy adherence, curating traces, and collecting human reasoning for why a conversation passed or failed.

  • He decomposes evaluation into narrow judges instead of one giant 'success' score — From error analysis he extracted four failure types—policy adherence, response style, information delivery, and tool calling—and argues separate binary judges are far easier to optimize than a single 1-to-5 omnibus score.

  • GEPA improved a weak seed judge, but it did not magically solve evaluation — Starting from a naive prompt that labeled 98% of cases compliant, optimization learned parts of the airline policy and raised validation accuracy from roughly 61% to 74%, while reducing the compliance bias to 64%.

  • Prompt optimization here is expensive, model-sensitive, and finicky — Smaller models like GPT-4o, mini/nano variants, Gemini, and DeepSeek often failed in different roles, his better setup used Gemini for reflection and Grok for judging, and even modest experiments burned through $200-$300 in tokens over 200-300 iterations.

The Breakdown

The dashboard says 'healthy,' but users say the agent is broken

Mahmoud starts with a painfully familiar story: someone slaps a hallucination judge into production, observability looks clean, and meanwhile customers are telling you the agent is failing. His point is blunt and funny—if the judge could actually tell whether the answer was hallucinated from a vague prompt alone, your app probably would've worked on day one.

Why calibrated judges matter for the whole AI engineering loop

He frames good LLM judges as the bottleneck-breaker for both offline evals and online monitoring. Human review is high quality but slow; LLM judges are fast but useless if they don't correlate with human labels, so the dream is a calibrated judge that lets you ship faster without fooling yourself. He ties that to the bigger 'holy grail' data flywheel: observe traces, add new evals, optimize again, repeat until improvement becomes almost automatic.

The test case: TauBench's airline support agent

The workshop uses Sierra's TauBench airline benchmark: a customer support agent with tools for reservations, flight info, and user data, plus a pretty gnarly airline policy. Mahmoud works with 599 conversation traces, post-processed into pass/fail annotations with explanations like 'approved the cancellation without verifying the airline cancellation rule,' and notes the dataset is realistic but messy rather than clean and academic.

Start with error analysis, not a generic metric

He insists metrics have to come from the use case, and says subject-matter experts—not benchmark defaults—should define them. Using an annotation workflow inspired by Hamel from Hamel.dev, he clusters failures into four error types: policy adherence, response style, information delivery, and incorrect tool use; that becomes four separate binary judges instead of one overloaded 'success' scorer.

Why annotation reasoning matters more than people think

A key practical point: labels alone are not enough. Mahmoud says the optimizer needs human reasoning about why a trace is compliant or non-compliant, otherwise it's asking the model to reverse-engineer hidden policy from outcomes alone; in a complex domain like airline rules, that's just too much.

How GEPA actually works in this setup

He gives a compact walkthrough of GEPA as a genetic-style prompt optimizer: start with a seed prompt, mutate or merge candidates, evaluate them on batches, and keep the promising ones. The interesting twist is the Pareto frontier idea—don't just keep the highest average performer, keep diverse prompts that each solve different traces, then try to merge that coverage into one stronger judge.

The notebook results: better, but still not 'solved'

In the demo, the seed judge is intentionally conservative: assume the agent is compliant unless there's a specific reason otherwise. That naive judge lands around 61% accuracy and predicts 'compliant' 98% of the time; after optimization, the prompt learns concrete rules around cancellation, flight modification, and communication, pushing validation accuracy to 74% and dramatically improving detection of non-compliance.

The real lessons came from what failed

Mahmoud is candid that this took debugging, not magic: smaller or older models were bad at both judging and refining, the default GEPA reflection prompt underperformed, and adding the full original agent policy to the seed prompt actually made results worse by trapping optimization in a local minimum. His advice sounds very ML-native: overfit first, inspect traces and generated candidates, tune the reflection prompt, and watch cost—because even this demo ran for hours and cost a few hundred dollars.