AI EngineerJune 2, 202620m

Task Fidelity Scaling Laws — Kobie Crawdord, Snorkel

TL;DR

High-quality tasks delivered a 5x bigger training gain: Using the same model, compute budget, and number of tasks, Snorkel saw about 6% improvement from accepted tasks versus about 1% from rejected tasks.
Task quality was defined with four concrete checks: Snorkel accepted tasks only if they were achievable, non-trivial, functionally correct, and run in a reliable environment.
Accepted tasks looked harder in the right way: Compared with rejected tasks, they averaged twice as many tool calls, had lower pass rates, and required more output tokens, suggesting more genuine reasoning and multi-step work.
The key signal was failure quality, not just failure quantity: Crawford says accepted tasks generated cleaner failures like logic errors or incomplete solutions, while rejected tasks were more likely to fail for environmental or degenerate reasons that teach the model little.
Underspecified tasks create fake difficulty: A common rejection pattern was when prompts did not clearly request something, but backend tests still expected it, or when hidden dependencies were required without being in the model context.
Snorkel's broader bet is expert-guided data plus rubrics: The company combines human experts, rubric-based review, and LLM judges, aiming for high inter-annotator agreement between humans and models even in fuzzier domains beyond coding and math.

The Breakdown

Snorkel says task quality is not a vague ideal but a measurable training variable: in its agentic terminal-task experiments, high-quality tasks produced about a 6% RL improvement versus roughly 1% from low-quality tasks, a 5x gap under the same compute and task count. Kobie Crawford argues the real tell is not just pass rate, but whether failures are "clean" and informative instead of noisy artifacts from broken, underspecified, or impossible tasks.