Back to Podcast Digest
AI Engineer13m

Spec-Driven Testing for Agents With A Brain the Size of A Planet — Steven Willmott, SafeIntelligence

TL;DR

  • Bigger is not automatically better or safer: Willmott says large models can actually be more vulnerable because they are smart enough to interpret obfuscated attacks, like harmful instructions hidden inside a poem that a smaller model might not even understand.

  • A dataset is only one slice of the spec: Beyond input-output examples, agent validation should include business rules, ontologies, internal terminology, domain knowledge, user roles, and robustness requirements such as typo tolerance and rephrasing stability.

  • Rules are where testing gets hard fast: A support agent rule like "never give a discount above 10%" or "no refunds after 30 days" sounds clear, but proving the rule is never violated across all phrasing and edge cases is the real challenge.

  • Spec information improves both security and robustness testing: Safe Intelligence pulls task specs into security checks because an agent is most exposed exactly where it is designed to act, such as banking workflows or customer support actions tied to real systems.

  • Keep tests independent of the implementation: Willmott argues your integration tests, unit tests, and penetration tests should survive a switch from LangSmith to Vertex or another stack, so the behavior contract stays stable while the plumbing changes.

  • He wants an OpenAPI-style future for agent specs: Drawing on his API infrastructure background, he suggests agent behavior should be expressed in open, versioned files that live in GitHub and can be consumed by any evaluation or testing tool.

The Breakdown

Bigger models can be easier to jailbreak, more expensive to run, and harder to control, which is why Steven Willmott argues agents need explicit specs, not just eval datasets. His pitch is simple: define the task, rules, domain terms, permissions, and robustness requirements up front, then test agents against that spec independent of whatever framework or model sits underneath.

Was This Useful?

Share