
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Static evals are breaking because AI systems no longer stay still — Vincent Koc argues that agentic apps, adaptive harnesses like OpenClaw, and fast-shipping AI software make fixed benchmarks feel like testing a moving target with yesterday’s assumptions.
AI needs its version of chaos engineering, not just benchmark worship — drawing on his Comet work with customers like Uber, Netflix, and UK banks, he says the industry over-indexes on handcrafted offline tests and under-invests in probing where agents actually fail in the wild.
The shift is from prompt engineering to intent engineering — after the era of “doom scroll, wordsmith instructions,” then context engineering with RAG and tool calling, Koc says 2025 is about systems that infer user intent and self-optimize toward outcomes.
Evaluations should optimize for end states, not canned answers — instead of checking “1 + 1 = 2” style outputs, he wants rubrics for ambiguity, personality, and business goals, where traces and telemetry continuously regenerate what gets tested.
The dangerous part is the changing 20%, not the stable 80% — most agent behavior may look repetitive, but it’s the weird new customer query or strange usage pattern that can wreck the business, so evals need to adapt as those edge cases emerge.
Telemetry-aware agents can start healing themselves — Koc points to harnesses that notice errors, cost spikes, or failures and self-correct, framing evals as living software or even agents themselves rather than frozen datasets.
Vincent Koc opens with a story that tells you exactly who he is: he wore 2013-era VR goggles for three hours even though the warning label said five minutes, then spent three hours vomiting afterward. His point is that life on the edge of technology is always a little broken and weird — and measurement has to account for that, not pretend systems are clean and stable.
At Comet, Koc works on eval research and benchmarking with universities and companies ranging from Uber to Netflix to UK banks, so he’s not anti-evals. But he says the industry’s fixation on static benchmarks has left a huge gap: unlike software engineering, where unit tests sit alongside observability and chaos engineering, AI still mostly relies on handcrafted question sets and offline checks.
He skewers the conference pattern everyone recognizes: endless benchmark papers that prove a model can do some narrow thing, without helping anyone understand a production agent. The result is giant datasets that feel reassuring until something goes wrong in the real world — and because AI apps are malleable, not static, failure is less an exception than an inevitability.
Koc references OpenClaw, which he contributes to, as proof that even the testing harness is now changing itself. If the harness adapts as skills and capabilities evolve, then traditional benchmarks can’t keep pace; the test itself has to become adaptive too, which is why he highlights emerging work on adaptive testing for LLM evals.
He describes prompt engineering with real contempt and humor — “doom scroll, wordsmith instructions,” just smashing words into models and hoping the output improves, like accidentally discovering a painkiller while trying to treat liver disease. Context engineering made things more tractable because RAG, tool calling, and MCP-based decomposition let teams test pieces of a larger agent, but in 2025 he thinks the real shift is toward intent engineering, where machines adapt to what the user is actually trying to achieve.
Part of the confusion, he says, is that many people still haven’t internalized how capable models have become. He points to optimization work and ARC-AGI-style puzzles as examples where models can pattern-match on problems that are genuinely difficult for humans, which means personalized, adaptive behavior is becoming normal — and evals now have to answer how your experience differs from mine and whether both are still “correct.”
His alternative is to move from static answer keys to intent-based outcomes: evaluate ambiguity, personality, and business goals with rubrics, and let traces from real usage generate new suites automatically. He wants online, always-on evals fed by telemetry, where agents notice what’s changing in customer behavior, what’s breaking, and what it’s costing — then update the tests and even self-correct.
Koc closes on a memorable framing: maybe 80% of agent behavior is the stable, known part, but the 20% that keeps changing is what can blow up your business. His thesis is that evals should stop being treated as frozen datasets and start being treated as code, software, or even living agents — self-optimizing systems defined by the end state you want, not a static set of examples from the past.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.