🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub
TL;DR
Metagenomics changed the scaling curve — Rives says ESM2 hit diminishing returns on UniRef, but ESMC regained a “beautiful scaling law” after adding billions of metagenomic sequences from diverse environments like hydrothermal vents, soil, oceans, and the human gut.
ESMC is positioned as a protein world model, not just a structure model — Biohub combined ESMC with structure prediction, mechanistic interpretability, and an atlas spanning 6.8 billion non-redundant proteins and 1.1 billion predicted structure clusters to map sequence, structure, and function across evolution.
Sparse autoencoders surfaced biology’s hierarchy inside the model — By probing a 6B-parameter ESMC with SAEs, the team found features ranging from basic biochemistry to abstract functional motifs, including a single feature for the “nucleophilic elbow” that activates across evolutionarily unrelated protein families.
The model is already doing useful design work — Rives says searching ESMC’s representation space can produce mini-protein binders and single-chain antibodies (scFvs), with affinities reaching the level needed for therapeutic activity, and he argues the approach may outperform MSA-based methods on antibodies.
Biohub’s next bottleneck is cell-scale data, not just better models — Rives frames virtual cells as a data problem first: Biohub’s Virtual Biology Initiative commits $400M internally plus $100M externally to generate fast, multimodal, perturbational, and spatial biology datasets that can support true generalization.
Open source is part of the strategy, not an afterthought — ESMC is being released fully open under an MIT license, and Rives repeatedly says Biohub is a philanthropy meant to build foundational tools for the scientific community rather than a closed drug-development company.
The Breakdown
Adding noisy metagenomics data erased the diminishing returns in protein language-model scaling, and Alex Rives says that shift turned ESMC into a true “world model” for proteins—one that now designs mini-binders and even therapeutic-grade single-chain antibodies. His bigger claim is even more ambitious: protein models are just the first rung toward open, feedback-driven virtual biology systems that can reason over millions of experiments and accelerate disease science.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.