0xSeroMay 29, 202639m

What exactly is a REAPed model | compression guide

TL;DR

REAP targets MoE models by removing low-value experts — Router-weighted activation pruning watches which experts fire on a calibration dataset, then cuts the ones that matter least for your use case, like coding or digital-assistant tasks.
The payoff can be huge: around 12.5% of original size with near-identical targeted performance — Sharif shows a 60 GB example compressed down to roughly 7.5 GB when combining recommended pruning and quantization, while preserving benchmark performance on the chosen task.
Big pruned models can beat native small models — His argument is that a chopped-down frontier model like GLM 5.1 still carries better data, engineering, and pretraining than a smaller model trained from scratch, citing Chinchilla-style scaling logic and higher token exposure.
This is as much about cost control as model quality — He says today’s AI pricing is heavily subsidized, compares it to early Uber, and claims companies may soon choose between six-figure annual token bills and buying their own GPU boxes to serve internal users.
Quantization and pruning are different levers, and both matter — Quantization shrinks 16-bit weights into smaller formats like 4-bit NVFP4, while pruning actually removes parts of the model; together, they took GLM 5.1 from a 1.5 TB weight footprint toward about 370 GB.
You don’t have to retrain the router after pruning, but compute is still the bottleneck — In the Q&A, Sharif says the router can repoint to surviving experts automatically, though better compute, more calibration samples, and eval-driven repair loops all improve the final result.

The Breakdown

A 1.5 TB mixture-of-experts model can be cut to roughly 12.5% of its original size and still perform almost the same on the tasks you care about — if you know which experts to keep. Sharif’s case for REAP isn’t just technical; it’s economic: AI token costs are headed up, GPU prices are rising, and pruning big models may be the only practical way to keep serious assistants under your own control.