Latent Space·April 20, 2026·1h 25m

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

TL;DR

Noetik’s core thesis is that cancer drugs mostly fail because trials enroll the wrong patients, not because the drugs are inherently bad — Ron Alfa argues 90–95% of oncology drugs fail largely due to poor patient selection, and says models trained on patient biology can both discover targets and identify which subpopulations should get which drug.
They bet on generating their own multimodal human tumor dataset from scratch, even before they knew a model would train — Noetik spent roughly 18 months building lab infrastructure, processing pipelines, and spatial transcriptomics workflows that take weeks per run, with no clear prior it would work, before OctoVC emerged around 2024.
Their data strategy is the moat: hundreds of randomized patient samples per array, patients spread across multiple slides, and more than 100 million spatially resolved cells — the goal is to beat batch effects and train models on paired H&E, protein imaging, spatial transcriptomics, and genotyping at a scale Dan Bear says is at least an order of magnitude beyond comparable public datasets.
The practical unlock is inference from standard H&E pathology images, not requiring every patient to get expensive multimodal assays — after training on rich lab data, Noetik says its models can take routine pathology slides from old or new trials, place responders and non-responders in latent space, and even predict spatial gene expression for interpretability.
They reject the fashionable ‘virtual cell’ framing if it means simulating every biochemical reaction, and instead define it as a useful patient-level world model for drug decisions — Bear says the point is not a full mechanistic cell simulator but a model that can ask actionable questions like whether a target matters in ovarian versus lung cancer or which tumor microenvironments will support T-cell activity.
Pharma is finally buying models as products, not just projects, and Noetik’s $50 million GSK deal is their proof point — the agreement includes upfront payment, milestones, and an annual license fee for OctoVC, with GSK able to use and fine-tune the models on its own translational data across multiple programs.

The Breakdown

The contrarian thesis: cancer trials fail because we don’t know the patients

Ron Alfa opens with a sharp inversion of the usual drug-development story: oncology isn’t mostly failing because pharma is bad at chemistry or target selection, but because it’s bad at matching drugs to the right patients. His claim is that what we call one cancer is often multiple hidden functional subtypes, and Noetik wants models to discover those subtypes directly from richer patient data rather than from simplistic biomarkers like one mutation or one stained protein.

Why cell lines and mouse studies leave clinical teams basically guessing

He gives a vivid takedown of standard preclinical workflows: immortalized cancer cell lines are “Frankensteinian,” often carrying weird genomes, abnormal chromosomes, and gene expression patterns that don’t resemble any real human cell. By the time a molecule reaches the clinic, teams often have so little patient-relevant signal that they run broad early-phase trials almost as fishing expeditions, hoping a few responders appear among 50 patients spread across many unknown subtypes.

Building the dataset first, before there was any proof it would pay off

Noetik’s answer was to build its own lab and data engine from first principles: source human tumors, process them into tissue arrays, and collect multimodal data designed for machine learning. Alfa says they spent about a year and a half just generating data and building pipelines before they even had enough to train a model, with spatial transcriptomics runs taking roughly two weeks for only a couple slides and no off-the-shelf blueprint to follow.

The stack: H&E, immune-cell protein stains, spatial RNA, and DNA all aligned in tissue

The team explains the logic of the modalities in almost “central dogma as computer vision” terms. They collect standard H&E pathology for clinical relevance, multiplex protein imaging to identify cell types like T cells and B cells, spatial transcriptomics to measure roughly 1,000 to nearly 20,000 genes in place, and genotyping for DNA alterations — all on the same tissue context, turning a tumor into a layered image with far more than RGB channels.

The real moat is not just data volume, but how the data is designed

A big chunk of the conversation is about batch effects, controls, and why “just gather a dataset” is naive in biology. Noetik randomizes hundreds of patient samples onto arrays, distributes each patient across multiple slides and batches, and uses that structure to test whether embeddings capture therapy response biology rather than artifacts of staining day or instrument conditions; Bear says dropping to 40% or 10% of their training data sharply hurts generalization.

What “virtual cell” means when you actually care about drugs working in humans

Alfa and Bear push back on the idea that a useful virtual cell has to simulate every intracellular chemical reaction. Their more grounded version is a patient-centric world model: simulate how a cell behaves in its tissue context, infer whether a target matters in a given tumor microenvironment, and use those outputs to pick indications, rescue failed trials, or identify responder clusters from self-supervised embeddings.

The H&E-only inference story is what makes this clinically usable

One of the stickiest moments is the claim that after multimodal training, inference can happen from ordinary H&E slides alone — the “lingua franca” of pathology. That means old phase 2 or phase 3 biopsies can be reanalyzed to see whether responders cluster together in latent space, and Noetik says it can also predict which genes are spatially expressed in responder versus non-responder groups, giving pharma a more interpretable story than a black-box score.

From transformers to business model: Tario, in-silico humanized mice, and the GSK deal

Later they get more technical: Tario is Noetik’s newer autoregressive transformer for spatial transcriptomics, and Bear says it shows better scaling than masked autoencoding, especially at longer context lengths where the model sees more tissue at once. They also describe a perturbation platform in mice with barcoded CRISPR knockouts and “in-silico humanizing” the mouse readout through human-trained models, then cap the episode with the business proof point: GSK licensed OctoVC in a $50 million deal with upfronts, milestones, and annual model fees — notable because the thing being licensed is the model itself, not a single molecule or narrow collaboration.