Frontier results, on device - RL Nabors, Arize
TL;DR
SLMs consume about 25% of the energy that LLMs require for equivalent tasks, with task-specific models using half that, making local inference dramatically more efficient.
The prototype big, deploy small framework means proving feasibility with the most capable model first, then systematically testing smaller models until you find one that meets your criteria.
Llama 3.2 3B beat Gemma 4 and Qwen in the speaker's evaluation for social thread summarization, reaching 90% accuracy compared to Claude Sonnet's baseline.
Few-shot prompting closed the accuracy gap between Llama and Claude from 10% to near-parity, while explicit negative constraints actually made performance worse.
Post-processing can fix structural issues like JSON validity and reference accuracy, meaning you don't need perfect model outputs to ship production-ready features.
Regression evals prevent CTO-induced disasters, catching when prompt or model changes break your agentic workflows before users notice.
The Breakdown
RL Nabors from Arize demonstrates how to replace expensive frontier model API calls with local small language models through a systematic evaluation process, showing that Llama 3.2 3B can match Claude Sonnet's performance on summarization tasks while eliminating API costs entirely. The talk introduces the SAGE model approach, selecting the smallest model that delivers acceptable results, and walks through a real case study using Phoenix to evaluate and compare models.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
The Cheapest Model That Passes
OpenRouter lists 400 models behind one API. The fix for choosing isn't a better leaderboard, it's a four-step protocol that ends in a real eval.

Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.