AI EngineerJune 25, 20268m

Production Evals For Agentic AI Systems - Nishant Gupta, Meta Superintelligence Labs

TL;DR

Agentic AI shifts evaluation from answers to workflows: Instead of asking 'did the model generate the correct answer?', teams must ask 'did the system behave correctly?' across planning, tool use, recovery, and multi-agent coordination.
Failure modes are hierarchical, not just hallucinations: Below reasoning errors lie memory failures, retrieval failures, safety failures, and at the top multi-agent coordination breakdowns. Evaluating only model output misses most production risk.
Production telemetry is the highest-value evaluation signal: Every real-user interaction becomes evaluation data. Execution traces, user outcomes, escalations, and feedback signals are far more representative than any offline benchmark.
Think like an SRE, not a researcher: The goal is not maximizing benchmark accuracy but maximizing dependable outcomes. Reliability, availability, latency, cost, and recovery become North Star metrics; accuracy is just one input.
Agent systems drift silently: Model updates, prompt changes, tool changes, and shifting user behavior cause reliability to degrade slowly. Without continuous monitoring, teams don't discover drift until users complain.
Evaluation is becoming part of the control plane, not a separate tool: The industry is moving toward an architecture where evaluation runs as an always-on service, collecting telemetry, running simulations, coordinating human review, and governing behavior in production.

The Breakdown

Benchmarks measure model capability, but production measures system behavior; for agentic AI, the real failure modes aren't hallucinations but tool failures, coordination breakdowns, and silent drift, so evaluation must become continuous infrastructure, not a pre-deployment testing phase.