Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI
TL;DR
Small models aren’t just shrunken big models — Maxime Labonne’s main point is that edge models face different constraints entirely: memory limits, lower knowledge capacity, and brutal latency requirements on phones, cars, and other on-device hardware.
Architecture choices matter more when every parameter counts — He highlights how Gemma 3 270M spends 63% of its parameters on embeddings and Qwen 3.5 0.8B spends 29%, while Liquid’s LFM2 keeps embeddings far smaller so more of the memory budget goes to “effective” reasoning parameters.
Liquid optimized LFM2 by profiling real hardware, not just theory — On-device testing on an AMD Ryzen AI Max+ 395 and a Samsung Galaxy S25 Ultra pushed them toward gated short convolutions, which he says beat sliding-window attention, GLA, and GQA on speed, memory, and throughput.
Overtraining tiny models can still pay off — Liquid pretrained a 350M-parameter model on 28 trillion tokens, far beyond classic Chinchilla expectations, and Labonne says newer scaling-law work from Roberts suggests they still haven’t used enough data to hit the true optimum.
Post-training for small models should be narrow and deliberate — Instead of chasing average performance everywhere, Liquid targeted data extraction and tool use, and Labonne says small models improve most when SFT, preference alignment, and RL are tightly matched to specific tasks and environments.
‘Doom looping’ is a real small-model failure mode, but it’s tractable — Their 1.2B reasoning model started around a 15–16% doom-loop ratio after mid-training, barely improved with SFT, dropped sharply with DPO, and became almost clean after RL with verifiable rewards and repetition penalties; he contrasts that with Qwen 3.5 0.8B reasoning mode, which he says can exceed 50% doom loops.
The Breakdown
Why edge models are a different species
Maxime Labonne opens with Liquid AI’s core focus: edge models from 350M to 24B parameters, including a freshly released 450M VLM and an updated 350M text model. His thesis is blunt: small models are memory-bound, task-specific, and latency-sensitive, so treating them like mini ChatGPTs misses the point.
The embedding-layer tax in tiny models
He compares Gemma 3 270M and Qwen 3.5 0.8B and zeroes in on something easy to overlook: embeddings eat a shocking amount of the parameter budget. In Gemma 3 270M, embeddings are 63% of total parameters; in Qwen 3.5 0.8B, they’re still 29%, which he argues is wasteful because those parameters aren’t doing the real reasoning work.
How Liquid designed LFM2 around actual devices
Liquid also uses a hybrid architecture, but the interesting bit is how they got there: profiling directly on target hardware instead of relying on paper logic. That process led them to gated short convolutions, which Labonne says are dramatically faster than alternatives like sliding-window attention, gated linear attention, and GQA, especially for the edge setting where latency is everything.
Real-world speed wins on CPU and GPU
He backs the architecture story with profiling on an AMD Ryzen AI Max+ 395 and a Samsung Galaxy S25 Ultra. The rough picture, he says, is that LFM2’s short-conv design is both faster and lighter on memory, and the gains carry over to GPU too, where throughput stays strong even at high concurrency.
Training tiny models way past old scaling-law intuition
Then he gets to the training recipe: pretraining on 28 trillion tokens, followed by SFT, preference alignment, and RL. For a 350M model, that sounds wild if you grew up on Chinchilla, but Labonne says performance keeps climbing anyway, and cites a recent Roberts paper on test-time scaling laws to argue they probably still undertrained relative to newer optima.
Narrow beats average in post-training
The benchmark story reflects that philosophy. Liquid’s newer 350M model improves meaningfully on GPQA Diamond, IFEval/IFBench-style instruction following, case-report extraction, and tool-use benchmarks like BFCL and Tau2Bench, because they intentionally optimized for data extraction and tool use instead of trying to be decent at everything from coding to math.
Doom looping: the tiny reasoning-model nightmare
One of the most vivid parts of the talk is “doom looping,” where the model repeats the same phrase forever. Labonne calls it especially common when you combine all three bad ingredients at once: a small model, reasoning mode, and a task that’s too hard — basically the perfect setup for it to spiral.
How they trained doom loops out of the model
Liquid attacks the problem first in preference alignment by generating multiple rollouts per prompt, using an LLM jury to pick the best answer and reject the worst — often the looping one. Then RL with verifiable rewards and a bit of n-gram repetition penalty finishes the job: their 1.2B reasoning model went from roughly 15–16% doom loops after mid-training to almost none after RL, which he contrasts with Qwen 3.5 0.8B reasoning mode that he says people report doom-looping over 50% of the time.
The bigger opportunity: tiny models with tools
Labonne closes on what small models are actually good for: agency. Since they’re knowledge-limited and weak on long context, the answer isn’t pretending otherwise — it’s giving them web search, Python, and recursive environments so they can compensate with tools, which he thinks is still underexplored compared with all the hype around giant agentic models.