Stop Making Models Bigger, Make Them Behave — Kobie Crawford, Snorkel
TL;DR
A tiny model beat a giant one on the target task: Snorkel and UC Berkeley's RLLM team trained a 4B model to outperform a 235B Qwen3 model on financial analysis tool use, with pass@1 roughly doubling after RL.
The failure mode was not reasoning, it was behavior: The 235B model guessed SQL against a non-existent table, failed twice, then hallucinated an answer, while the fine-tuned 4B model first called
get_table_names, inspected schema, and corrected its own query error.The training run was cheap enough to matter: The RL job used GRPO, ran in about 21 hours, and cost under $500 per run, which Crawford framed as proof that behavior tuning can be practical for teams trying to ship smaller on-prem models.
Single-table training generalized better than expected: Training only on single-table questions produced the best uplift, yet still improved the harder multi-table FinQA Reasoning benchmark from 13.9 to 26.6.
High-quality expert data was the core bet: Snorkel built the dataset with domain experts in the loop, verified tasks and answers, and argued that carefully chosen data is what lets RL target the exact behavior a production system is missing.
Rubrics help find the real bug before RL starts: Crawford said richer eval rubrics can break a model's failures into specific behaviors, so teams can identify whether they need more knowledge, better tool use, or another targeted fix instead of just swapping in a bigger model.
The Breakdown
A 4 billion parameter model beat a 235 billion parameter model on financial tool use after a 21-hour RL run that cost under $500, because the real problem was not reasoning depth. It was tool discipline: learning to inspect tables, read schemas, recover from errors, and stop hallucinating.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.