
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Agent observability starts before production, not after — Amy Boyd frames the whole talk around the London Tube’s “mind the gap” sign: the platform is your requirements, the train is your agent, and evals plus monitoring are how you keep reality aligned as models, users, and environments change.
Trace-linked evaluations are the key move — Nitya Narasimhan’s core point is that observability is not just noticing failure, but shortening the time from detection to diagnosis by tying eval results directly to traces, tool calls, and workflow steps.
Microsoft’s stack leans on open standards and mixed environments — Foundry tracing is built on OpenTelemetry, so teams can instrument agents built elsewhere and still pull them into the Foundry control plane, then connect that data into Azure Monitor so developers and IT admins share the same telemetry surface.
The workshop walks from zero to multi-agent observability fast — they build a Contoso travel agent with GPT-4.1, web search, App Insights, function tools for flights/hotels/cars, then a workflow agent with specialist sub-agents, showing traces, quality/safety/agentic evals, and red teaming in one end-to-end flow.
Red teaming is treated as a separate muscle from normal evals — Nitya uses a house analogy: quality evals are the building inspector checking code compliance, while safeguarding is asking someone to break into the house, with attacks like leetspeak and crescendo probes used to find vulnerabilities guardrails miss.
The most forward-looking part is the new 'observe' skill — in an early preview released “literally two weeks ago,” a coding agent can generate eval datasets, run baseline batch evals, optimize prompts, compare versions, and even roll back to the best-performing version, all with a human in the loop.
Amy Boyd opens with a London-perfect metaphor that actually holds up: the gap is the distance between what your agent is supposed to do and what it really does in the wild. She stretches it across quality, safety, and monitoring — guardrails warn users, platforms change over time, and your job is to keep checking whether the “train” still fits the “platform.”
They immediately ground the talk in assets: a GitHub repo they’ve been evolving over multiple workshops and a Microsoft Foundry Discord with a dedicated AI Engineer channel. Nitya is blunt about expectations — this isn’t a “learn everything in 30 minutes” session, it’s a “cooking show,” with a compressed 4-hour workshop meant to get people started and then experimenting on their own.
Amy lays out the three-part reliability loop: evaluate for performance, quality, and safety; monitor over time as requirements and users shift; then optimize based on the data instead of staring at scores with no next step. She also positions Microsoft Foundry as flexible rather than all-or-nothing: build there, host there, or just bring in agents from elsewhere and observe them centrally.
The live build starts with a fictional Contoso travel assistant inside ai.azure.com, using a quick-start project, GPT-4.1, Bing web search, and App Insights for tracing. Amy shows how little setup it takes to get from a blank project to a grounded agent whose traces expose tokens, estimated cost, and built-in AI quality and safety metrics — and she points out a low task-adherence score as the kind of early warning you want before shipping.
Nitya then shifts into notebooks and code spaces, walking through a staged build: connect to Foundry, create a basic agent, add function tools for flights/hotels/cars, and then split the monolith into specialist agents orchestrated through Foundry workflows. Her point isn’t just architecture for architecture’s sake — once each step has its own traceable path, you can finally see which sub-agent is failing, over-spending, or making weak tool calls.
Here the talk gets more concrete about observability itself: enable OpenTelemetry-style tracing, add custom trace attributes, and push telemetry to Azure Monitor so AI behavior sits alongside the rest of your cloud signals. Then come built-in evaluators across quality, safety, and agentic behavior like intent resolution, tool calling, and task adherence — including a memorable example where groundedness fails because the system answers with August 2024 instead of 2025.
Nitya makes a clean distinction between normal usage and hostile usage. Safety checks handle expected behavior, but red teaming means unleashing a second AI to attack your first one with strategies like leetspeak, indirect prompt attacks, or slow-burn “crescendo” attacks — the “frog in boiling water” analogy she uses to describe an attack that escalates before the model realizes what’s happening.
The most energizing section is the demo of the early-preview Foundry “observe” skill inside GitHub Copilot chat. Instead of wiring evals by hand, the agent inspects the project, generates an eval dataset, runs baseline batch evals, spots weak task adherence, rewrites prompts, compares versions, and suggests rolling back to version 5 when later prompt tweaks regress quality — a very concrete vision of human-in-the-loop observability instead of dashboard tourism.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.