LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
TL;DR
Observability is the audit trail for agents: Dat argues code no longer tells you what an agent did, telemetry does, which is why Arize is built around OpenTelemetry traces, spans, sessions, and distribution views of agent paths.
A good fix can create two or three new failures: In non-deterministic systems, changing a prompt, model, or orchestration may solve one issue while causing regressions elsewhere, so you need evals tied to actual system behavior.
There are five flavors of signal, not just LLM-as-a-judge: Dat highlights human feedback, golden datasets, deterministic checks like JSON schema validation, model-based judges, and business metrics such as saving money, making money, or saving time.
Evals work at multiple scopes: He distinguishes span evals for one input-output step, multi-span evals across components, trajectory evals for the full path an agent took, and session evals for the whole conversation state.
Arize thinks the entire optimization loop should be automated: Dat says users do not want to live in dashboards, so Arize exposes everything through CLI tools and an AI assistant called Alex that can inspect traces, spot latency or errors, and propose what to evaluate next.
Arize splits its products by audience: Phoenix is the open source, single-container option for engineering teams, while Arize AX is aimed at large enterprises like Uber, Booking, and Reddit.
The Breakdown
Dat Ngo says the real problem in enterprise AI is not building agents, it is seeing what they did, deciding what counts as good, and catching the regressions your "fix" quietly introduced elsewhere. He lays out Arize's stack for observability, evals, and experimentation, then makes the bigger claim that the whole loop should eventually run itself.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.