AI EngineerMay 14, 20261h 20m

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

TL;DR

Agent observability starts before production, not after — Amy Boyd frames the whole talk around the London Tube’s “mind the gap” sign: the platform is your requirements, the train is your agent, and evals plus monitoring are how you keep reality aligned as models, users, and environments change.
Trace-linked evaluations are the key move — Nitya Narasimhan’s core point is that observability is not just noticing failure, but shortening the time from detection to diagnosis by tying eval results directly to traces, tool calls, and workflow steps.
Microsoft’s stack leans on open standards and mixed environments — Foundry tracing is built on OpenTelemetry, so teams can instrument agents built elsewhere and still pull them into the Foundry control plane, then connect that data into Azure Monitor so developers and IT admins share the same telemetry surface.
The workshop walks from zero to multi-agent observability fast — they build a Contoso travel agent with GPT-4.1, web search, App Insights, function tools for flights/hotels/cars, then a workflow agent with specialist sub-agents, showing traces, quality/safety/agentic evals, and red teaming in one end-to-end flow.
Red teaming is treated as a separate muscle from normal evals — Nitya uses a house analogy: quality evals are the building inspector checking code compliance, while safeguarding is asking someone to break into the house, with attacks like leetspeak and crescendo probes used to find vulnerabilities guardrails miss.
The most forward-looking part is the new 'observe' skill — in an early preview released “literally two weeks ago,” a coding agent can generate eval datasets, run baseline batch evals, optimize prompts, compare versions, and even roll back to the best-performing version, all with a human in the loop.

Summary

“Mind the gap” becomes a real observability framework

Amy Boyd opens with a London-perfect metaphor that actually holds up: the gap is the distance between what your agent is supposed to do and what it really does in the wild. She stretches it across quality, safety, and monitoring — guardrails warn users, platforms change over time, and your job is to keep checking whether the “train” still fits the “platform.”

The practical setup: repo, Discord, and why this is bigger than one session

They immediately ground the talk in assets: a GitHub repo they’ve been evolving over multiple workshops and a Microsoft Foundry Discord with a dedicated AI Engineer channel. Nitya is blunt about expectations — this isn’t a “learn everything in 30 minutes” session, it’s a “cooking show,” with a compressed 4-hour workshop meant to get people started and then experimenting on their own.

Non-deterministic agents need eval, monitoring, and optimization together

Amy lays out the three-part reliability loop: evaluate for performance, quality, and safety; monitor over time as requirements and users shift; then optimize based on the data instead of staring at scores with no next step. She also positions Microsoft Foundry as flexible rather than all-or-nothing: build there, host there, or just bring in agents from elsewhere and observe them centrally.

From simple travel agent to traceable system in Foundry

The live build starts with a fictional Contoso travel assistant inside ai.azure.com, using a quick-start project, GPT-4.1, Bing web search, and App Insights for tracing. Amy shows how little setup it takes to get from a blank project to a grounded agent whose traces expose tokens, estimated cost, and built-in AI quality and safety metrics — and she points out a low task-adherence score as the kind of early warning you want before shipping.

The SDK path: from one agent to specialist agents with workflows

Nitya then shifts into notebooks and code spaces, walking through a staged build: connect to Foundry, create a basic agent, add function tools for flights/hotels/cars, and then split the monolith into specialist agents orchestrated through Foundry workflows. Her point isn’t just architecture for architecture’s sake — once each step has its own traceable path, you can finally see which sub-agent is failing, over-spending, or making weak tool calls.

Custom tracing and evaluations turn debugging into diagnosis

Here the talk gets more concrete about observability itself: enable OpenTelemetry-style tracing, add custom trace attributes, and push telemetry to Azure Monitor so AI behavior sits alongside the rest of your cloud signals. Then come built-in evaluators across quality, safety, and agentic behavior like intent resolution, tool calling, and task adherence — including a memorable example where groundedness fails because the system answers with August 2024 instead of 2025.

Red teaming: not “does it work,” but “can someone break it?”

Nitya makes a clean distinction between normal usage and hostile usage. Safety checks handle expected behavior, but red teaming means unleashing a second AI to attack your first one with strategies like leetspeak, indirect prompt attacks, or slow-burn “crescendo” attacks — the “frog in boiling water” analogy she uses to describe an attack that escalates before the model realizes what’s happening.

The big finish: coding agents that run the observability loop for you

The most energizing section is the demo of the early-preview Foundry “observe” skill inside GitHub Copilot chat. Instead of wiring evals by hand, the agent inspects the project, generates an eval dataset, runs baseline batch evals, spots weak task adherence, rewrites prompts, compares versions, and suggests rolling back to version 5 when later prompt tweaks regress quality — a very concrete vision of human-in-the-loop observability instead of dashboard tourism.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

Summary

“Mind the gap” becomes a real observability framework

The practical setup: repo, Discord, and why this is bigger than one session

Non-deterministic agents need eval, monitoring, and optimization together

From simple travel agent to traceable system in Foundry

The SDK path: from one agent to specialist agents with workflows

Custom tracing and evaluations turn debugging into diagnosis

Red teaming: not “does it work,” but “can someone break it?”

The big finish: coding agents that run the observability loop for you

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

“Mind the gap” becomes a real observability framework

The practical setup: repo, Discord, and why this is bigger than one session

Non-deterministic agents need eval, monitoring, and optimization together

From simple travel agent to traceable system in Foundry

The SDK path: from one agent to specialist agents with workflows

Custom tracing and evaluations turn debugging into diagnosis

Red teaming: not “does it work,” but “can someone break it?”

The big finish: coding agents that run the observability loop for you

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks