Spec-Driven Testing for Agents With A Brain the Size of A Planet — Steven Willmott, SafeIntelligence
TL;DR
Bigger is not automatically better or safer: Willmott says large models can actually be more vulnerable because they are smart enough to interpret obfuscated attacks, like harmful instructions hidden inside a poem that a smaller model might not even understand.
A dataset is only one slice of the spec: Beyond input-output examples, agent validation should include business rules, ontologies, internal terminology, domain knowledge, user roles, and robustness requirements such as typo tolerance and rephrasing stability.
Rules are where testing gets hard fast: A support agent rule like "never give a discount above 10%" or "no refunds after 30 days" sounds clear, but proving the rule is never violated across all phrasing and edge cases is the real challenge.
Spec information improves both security and robustness testing: Safe Intelligence pulls task specs into security checks because an agent is most exposed exactly where it is designed to act, such as banking workflows or customer support actions tied to real systems.
Keep tests independent of the implementation: Willmott argues your integration tests, unit tests, and penetration tests should survive a switch from LangSmith to Vertex or another stack, so the behavior contract stays stable while the plumbing changes.
He wants an OpenAPI-style future for agent specs: Drawing on his API infrastructure background, he suggests agent behavior should be expressed in open, versioned files that live in GitHub and can be consumed by any evaluation or testing tool.
The Breakdown
Bigger models can be easier to jailbreak, more expensive to run, and harder to control, which is why Steven Willmott argues agents need explicit specs, not just eval datasets. His pitch is simple: define the task, rules, domain terms, permissions, and robustness requirements up front, then test agents against that spec independent of whatever framework or model sits underneath.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.