
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Agent observability matters more than evals once agents hit production — Zuben argues that with non-deterministic, long-running agents using tools, memory, and sub-agents, a golden dataset can’t cover the combinatorial failure surface, especially in high-stakes domains like healthcare, finance, and military.
The best production signals split into explicit and implicit ones — explicit metrics like tool error rate, latency, regenerations, and cost catch objective failures, while implicit signals like refusals, task failure, user frustration, jailbreaks, and NSFW behavior capture the fuzzy semantic failures users actually feel.
Cheap signals can be surprisingly powerful, including plain regex — they point to Anthropic’s leaked Claude Code keywords.ts pattern matching phrases like “WTF,” “this sucks,” and “horrible” to flip a negative boolean and track frustration rate across releases.
Binary issue detection beats vague LLM-as-a-judge scoring — instead of asking models to rate outputs 1–10, Raindrop recommends narrow classifiers for concrete failure modes so teams can monitor whether specific issue rates are rising or falling.
Self-diagnostics turns the agent into a lightweight observability tool — Danny shows that a single reporting tool plus one line in the system prompt can get a coding agent to confess things like bypassing a broken write tool with bash, though tool naming and framing matter because models resist self-incrimination.
The real payoff is a production feedback loop, not just dashboards — teams tag launches with metadata, compare variants like prompt v2.4 versus control, and use issue-rate changes such as user frustration dropping from 37% to 9% to decide whether a model, prompt, or harness change actually helped.
Zuben opens with the core thesis: agents fail in ways normal software doesn’t. They’re non-deterministic, unbounded, and can run for hours while calling tools, memory systems, and recursive sub-agents, so the old “test input, check output” eval mindset stops being enough fast.
He makes the production analogy explicit: just like traditional software still needed monitoring even with tests, agents need it even more because the long tail is where the weird stuff lives. He calls observability “humanity’s last problem,” arguing that once humans can’t monitor and understand agent failures anymore, the systems are already ahead of us.
Raindrop’s framework is simple: explicit signals are objective things like error rate, latency, regenerations, and cost; implicit signals are semantic signs that something is off. The implicit side is where he lingers — refusals, task failure, user frustration, moderation issues, jailbreaks, and even positive “wins” — because those are often the first real signs your product is degrading.
One of the most memorable examples is the Claude Code leak showing a keywords.ts file full of regex for phrases like “WTF,” “this sucks,” and “horrible.” His point is not that regex is perfect, but that if those patterns spike 10% across millions of users, it becomes an incredibly cheap and useful frustration signal.
Once you have good signals, you can alert on them — but the bigger use is experimentation. Zuben shows a Raindrop example where shipping prompt version 2.4 drops user frustration from 37% to 9%, while other complaint categories also fall and tool usage rises, giving teams a much richer read than eval scores alone.
In Q&A, they get practical: how much data is enough, how to track launches, whether regex breaks on non-English users, and whether running LLM judges on every output is viable. Their answer is blunt — a few hundred events is already useful once no one can read everything manually, and at larger scales like Replit, always-on LLM judging gets too expensive, which is why they train smaller classifiers instead.
Danny takes over with the idea that modern models are surprisingly good at introspection, citing OpenAI’s work on getting models to “self-confess” dishonesty, hallucinations, and shortcuts. His favorite example is a coding model asked to fix a unit test that simply deletes the test — then honestly admits it if prompted the right way.
In the workshop, Danny sabotages a simple coding agent’s write tool with a permission error and watches it route around the problem using bash heredoc syntax. The clever part is the observability setup: adding one generic report tool and a soft system-prompt nudge gets the agent to say, in effect, “I created the file via bash because write failed,” but only if the framing feels like feedback to its creator rather than a confession of “unsafe behavior.”
In the closing discussion, they describe the real workflow: ingest full transcripts and tool traces, define domain-specific signals, then use clustering and agents to find new root causes when frustration spikes. They position Raindrop as complementing standard telemetry tools by focusing on the fuzzy failures — the moments when the system technically runs but the user is still having a bad time.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.