AI EngineerMay 24, 202615m

Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum

Summary

{ "tldr": [ "Adrian Bertagnoli says the industry is leaving “homogeneous intelligence” behind — instead of one giant model on identical chips, the next wave is mixtures of experts, multi-agent workflows, and hardware stacks that route different tasks to different models and silicon.", "Callosum’s core claim is that real-world problems are naturally heterogeneous — open-ended tasks break into subproblems requiring different kinds of reasoning, so forcing one generalist model to do everything is, in his framing, inefficient and structurally the wrong fit.", "He says they formalized this mathematically and found heterogeneous systems outperform homogeneous ones under realistic constraints — drawing parallels across neuroscience, economics, and ecology, he argues diversity of agents plus communication beats a uniform fleet of generalists.", "In long-context reasoning, Callosum extended MIT’s recursive language model idea into a cross-model, cross-chip system — on the ULong benchmark, Bertagnoli says their setup on Cerebras was 7x cheaper and 5x faster than GPT-5.2, while SambaNova was 12x cheaper and 3x faster.", "For visual web navigation, a mixed stack of open and closed video action language models beat frontier single-model baselines — he claims their system outperformed GPT-5.2 and Gemini 2.5 on VideoWebArena by 18% and 25%, respectively, while also shifting the speed-cost frontier.", "A lot of the gain came from not wasting top-tier models on trivial subtasks — Bertagnoli’s memorable example is zooming: “you don’t need GPT to zoom for you,” and offloading those operations made those subtasks 11x faster and 43x cheaper than using ChatGPT.", ], "breakdown": "### From giant uniform models to a more mixed future\n\nAdrian Bertagnoli, a founding engineer at Callosum, opens by contrasting today’s dominant paradigm — scaling single models on fleets of identical chips — with what he calls the next phase: heterogeneous intelligence. He frames the current moment as “mild heterogeneity” already in motion, with mixture-of-experts replacing dense models, multi-agent systems replacing single LLM calls, and prefill/decode disaggregation replacing single-chip execution.\n\n### What “full” heterogeneity looks like\n\nHe sketches a three-stage progression: first, mostly homogeneous clusters with prompt-level variation; then different models on different chips interacting; and finally a deeper co-evolution of hardware and software. The endgame, in his telling, is vertical integration where intelligence and silicon are designed together instead of being awkwardly layered on top of each other.\n\n### Why one kind of intelligence is the wrong shape for real problems\n\nBertagnoli’s central argument is simple: real-world tasks are multi-step, messy, and open-ended, so they naturally decompose into subproblems that need different kinds of intelligence. His phrase for the solution is “multi-agent heterogeneous intelligence” — models of different architectures and sizes working together over long horizons, rather than a single giant model trying to be everything at once.\n\n### The mathematical case for diversity over uniformity\n\nHe says this isn’t just a philosophical preference: Callosum formalized the benefit using a “principle of maximum heterogeneity.” Using a production-function analogy, he argues that homogeneous systems can maybe scale one peak or become broad but shallow generalists, while heterogeneous agents with different skill distributions can better match the actual demand curve; he says the same pattern shows up in neuroscience, economics, and ecology.\n\n### Recursive language models, but routed across chips and models\n\nThe first concrete example is long-context reasoning. Building on MIT’s recursive language model paper from last October, Bertagnoli explains “context rot” with a nice distinction: needle-in-a-haystack retrieval is O(1), but row-and-column aggregation can become O(N) or worse, so performance falls apart even before the context window is full.\n\n### Their benchmark result: cheaper and faster without giving up capability\n\nThe MIT-style trick is to treat the context like an environment and let an agent explore it programmatically via Python REPL, keyword search, and regex, recursively spawning more agents as needed. Callosum extends that by routing subcontexts to different models on different chips; on ULong, he says a Cerebras-backed setup was 7x cheaper and 5x faster than GPT-5.2 at roughly $3.75 and 2,000 seconds per task for the baseline, while SambaNova pushed cost down even further to 12x cheaper with 3x faster performance.\n\n### Beating GPT-5.2 and Gemini 2.5 on visual web navigation\n\nThe second case study is visual web navigation using a mixture of open and closed video action language models. Bertagnoli says they beat the VideoWebArena state of the art, outperforming GPT-5.2 and Gemini 2.5 by 18% and 25%, because they stopped treating visual and textual reasoning as one homogeneous problem and instead assigned the subcomponents to the models best suited for them.\n\n### “You don’t need GPT to zoom for you”\n\nThat line gets at the practical philosophy of the whole talk. By offloading simple subtasks like zooming and visual manipulation to less capable but cheaper models, he says those subtasks alone became 11x faster and 43x cheaper than using ChatGPT, contributing to systems that were up to 3x faster and 3.7x cheaper overall — with an automation layer now deciding the best model and hardware for each task instead of relying on hardcoded routing.\n\n### The compute story ends with a hiring pitch\n\nBertagnoli closes with a broader industry frame: first CPUs made compute faster, then GPUs made it massively parallel, and the third era will make compute heterogeneous and agent-native. He mentions a £3 million ARIA grant to build what he calls the first heterogeneous colocated cluster in the UK, then lands on his final line: “This is the worst our infrastructure will ever be.”", "oneLiner": "Adrian Bertagnoli argues that the next scaling paradigm in AI is not bigger single models, but heterogeneous systems that route different subtasks to the right models and chips to get better performance at lower cost.", "tags": ["research", "industry", "commentary"] }