Tasteful Skills

Most agent skills fail because they read like documentation. A great agent skill, however, is a behavioral contract.
A good skill can still read like documentation. It would explain the domain, list best practices, and give the agent a tidy page of advice. Then the agent gets a messy task, thin context, competing goals, and a time budget. Under pressure, it does what agents often do: it skips the step that feels obvious, cheap, or slow.
A great skill fixes that.
A great skill is a behavioral interrupt. It makes the agent do the one thing it would otherwise skip. It loads at the right moment, forces the right loop, produces evidence, and stops before it turns into polish theater.
Good skills help an agent remember what to do. Great skills change what the agent does when the task gets awkward.
Agents don’t fail only from ignorance. They fail from bad defaults. They rush. They infer too much. They optimize before measuring. They refactor before proving behavior. They call subjective work “done” because it looks plausible. They treat a spec as inspiration instead of a contract.
A skill earns its place when it changes those defaults.
Skills Are Behavioral Contracts
A weak skill says: “Here are best practices.” A strong skill says: “When this situation appears, do this loop, produce this evidence, and stop only when this condition is met.” That’s the test. A skill should change behavior, not just improve wording. Ask one question before writing anything: What would the agent skip if it were rushed?
For performance work, it skips profiling. For refactoring, it skips proving behavior stayed the same. For UI polish, it skips preserving screenshots. For conformance work, it skips mapping every requirement to evidence.
The skipped step is the center of the skill.
A performance skill should not begin with “optimize carefully.” It should begin with the rule that blocks guesswork: “Profile first. Prove behavior unchanged. One change at a time.”
A refactoring skill should not say “clean up the code.” It should block the classic failure mode: “Prove behavior identical, then remove lines. No proof, no delete.”
The Trigger And The Loop
The description is not decoration. It’s the selector. That means trigger logic belongs in the description, not buried in the body.
Bad trigger: “Helps with code quality.”
Better trigger: “Use when the user asks to refactor, simplify, deduplicate, modularize, or reduce code while preserving behavior.”
Best trigger: “Use when the user asks to refactor, simplify, deduplicate, modularize, or reduce code while preserving behavior. Do not use it for feature changes, bug fixes, or exploratory architecture advice unless the user explicitly asks to preserve current behavior.”
That last version does 3 jobs. It names the task. It mirrors the user’s likely language. It adds exclusions. This prevents agents missing the skill when it matters most, or loading it for nearby tasks where it adds friction.
A great skill needs one memorable rule. That rule is the kernel. It should be short enough for the agent to carry through the task, and specific enough to block the dominant failure mode.
“Profile first. Prove behavior unchanged. One change at a time.”
“Prove behavior identical, then remove lines. No proof, no delete.”
“One surface, one approved aspiration, one visual lever per pass.”
The kernel protects the work. That’s where leverage is gained from running things in a loop.
A performance optimization skill should force this loop:
- Baseline current performance.
- Profile the bottleneck.
- Prove current behavior.
- Choose 1 optimization lever.
- Apply the change.
- Verify behavior stayed the same.
- Measure again.
- Repeat only if the next opportunity clears the threshold.
A refactoring skill should force a different loop: prove current behavior, remove or simplify one thing, verify, record the change, repeat only while the next change has clear value.
A UI skill needs another loop: capture the current surface, name the aspiration, score the gap, change one visual lever, capture the new surface, compare, stop at plateau.
Proof Or It Didn’t Happen
Agents are good at sounding done. They’ll say the code is cleaner, the UI is better, the output matches the spec, and the optimization worked. Sometimes they’re right. Often the claim needs proof.
Great skills force proof artifacts. The artifact is not paperwork. It makes success inspectable.
Use a LOC ledger when the skill removes code. Use an isomorphism card when behavior must stay unchanged. Use golden diffs when outputs must not regress. Use screenshot pairs when visual quality matters. Use conformance matrices when a spec contains MUST and SHOULD clauses. Use JSON verdicts when the skill makes a narrow judgment. Use provenance records when fixtures, assets, or generated files could become mystery debris later.
The artifact should show 4 things: what changed, why it changed, what evidence supports the claim, and what remains uncertain. This is where oracle choice matters. Pick the right source of truth.
A golden oracle compares against known-good output. A reference oracle compares against a trusted implementation or spec. A metamorphic oracle checks relationships between outputs when the exact answer is unknowable. A human oracle handles taste, judgment, or product intent.
The 10-Minute Upgrade
- Sharpen The Trigger: Add exact user phrases, task contexts, and exclusions.
- Distill The Kernel: Compress the method into one rule that blocks the main failure.
- Gate The Loop: Make steps mandatory only where skipping them creates risk.
- Name The Proof: Require the ledger, matrix, screenshot, diff, verdict, or report.
- Add A Stop Rule: Define how the work ends.
An “Excellent” stop means the output clears the defined bar; a “Plateau” stop means repeated passes stop improving the score; a “Cap” stop means the task hits a pass, time, cost, or scope limit; a “Halt” stop means the work is blocked, unsafe, or underspecified; and a “Handover” stop means the next decision needs human judgment.
Share


