Back to Podcast Digest
Latent Space1h 10m

🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

TL;DR

  • Metagenomics changed the scaling curve — Rives says ESM2 hit diminishing returns on UniRef, but ESMC regained a “beautiful scaling law” after adding billions of metagenomic sequences from diverse environments like hydrothermal vents, soil, oceans, and the human gut.

  • ESMC is positioned as a protein world model, not just a structure model — Biohub combined ESMC with structure prediction, mechanistic interpretability, and an atlas spanning 6.8 billion non-redundant proteins and 1.1 billion predicted structure clusters to map sequence, structure, and function across evolution.

  • Sparse autoencoders surfaced biology’s hierarchy inside the model — By probing a 6B-parameter ESMC with SAEs, the team found features ranging from basic biochemistry to abstract functional motifs, including a single feature for the “nucleophilic elbow” that activates across evolutionarily unrelated protein families.

  • The model is already doing useful design work — Rives says searching ESMC’s representation space can produce mini-protein binders and single-chain antibodies (scFvs), with affinities reaching the level needed for therapeutic activity, and he argues the approach may outperform MSA-based methods on antibodies.

  • Biohub’s next bottleneck is cell-scale data, not just better models — Rives frames virtual cells as a data problem first: Biohub’s Virtual Biology Initiative commits $400M internally plus $100M externally to generate fast, multimodal, perturbational, and spatial biology datasets that can support true generalization.

  • Open source is part of the strategy, not an afterthought — ESMC is being released fully open under an MIT license, and Rives repeatedly says Biohub is a philanthropy meant to build foundational tools for the scientific community rather than a closed drug-development company.

The Breakdown

Adding noisy metagenomics data erased the diminishing returns in protein language-model scaling, and Alex Rives says that shift turned ESMC into a true “world model” for proteins—one that now designs mini-binders and even therapeutic-grade single-chain antibodies. His bigger claim is even more ambitious: protein models are just the first rung toward open, feedback-driven virtual biology systems that can reason over millions of experiments and accelerate disease science.

Was This Useful?

Share