Back to Podcast Digest
AI Engineer10m

How We Built Zeta2: Training an Edit Prediction Model in Production — Ben Kunkle, Zed

TL;DR

  • Zed 2 is a small, specialized model built for speed — Ben Kunkle says edit prediction has to run on every keystroke, so Zed fine-tuned a narrow model for one job instead of relying on a larger general-purpose system.

  • The training pipeline starts with real editor snapshots, not synthetic prompts — Zed uses opt-in production data including cursor position, recent edits, nearby types and definitions, and diagnostics, then asks a frontier model what edit it would make in that exact context.

  • Frontier-model outputs were unreliable enough that Zed added a repair pass — if a teacher prediction crosses the editable-region boundary or just undoes what the user typed, Zed sends it to another model with a “you failed in this way, fix it” prompt before using it as training data.

  • Most of the pipeline is reusable and cached as giant JSONL files — each stage just adds or moves fields in a single-line JSON object, which lets the team reuse the expensive teacher/distillation work across many prompt-format experiments.

  • “Settled data” is promising but noisy, so Zed looks for near-matches instead of exact final code — after a user stops editing a region for 10 seconds, Zed compares the final state to multiple model samples with Levenshtein-style distance and keeps the middle band: not obvious, not random, but actually learnable.

  • Offline metrics don’t decide the winner by themselves — along with delta-carf, reversal ratio, and diagnostic error counts, Zed ships models to 15%+ of production traffic and watches acceptance rate and latency because eval scores don’t always match what users want in the editor.

The Breakdown

Zed trained its edit-prediction model on opt-in production snapshots, then used frontier models to generate and repair predictions before distilling them into a tiny model fast enough to run on every keystroke. The most interesting twist is “settled data”: instead of trusting what the user eventually typed, they filter for examples where many model samples land near the final code, using the student model itself to avoid a million expensive teacher calls.

Was This Useful?

Share