Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
TL;DR
Data curation is still the underrated lever — Sander Dieleman says time spent inspecting and improving training data can beat model or optimizer tweaks, even though academia historically incentivized standardized datasets over actually looking at the data.
Modern image and video models don’t generate pixels directly — they generate learned latents — using autoencoders like the latent setup in Stable Diffusion, a 256×256×3 image can be compressed to roughly a 32×32 grid with extra channels, and for video this can cut memory by up to two orders of magnitude.
Dieleman’s core intuition for diffusion is “spectral autoregression” — because natural images have a power-law frequency spectrum while Gaussian noise is flat, adding noise wipes out high frequencies first, so diffusion effectively generates visuals coarse-to-fine instead of token-by-token.
Guidance is the trick that makes diffusion models look stronger than their size suggests — classifier-free guidance does two predictions per step, compares conditioned vs. unconditioned outputs, and amplifies the difference, sharply improving prompt fidelity at the cost of diversity.
Sampling quality depends on following a nonlinear path through latent space, not just running more steps forever — more denoising steps help up to a point, and newer methods like consistency models and rectified flow try to make that path straighter so you can sample in 1–3 steps instead of ~50.
Text prompting is no longer enough for useful generative video — for products like Veo and Nano Banana, Dieleman says the real frontier is richer control signals such as reference images, camera motion, event timing, and post-training methods like preference tuning.
The Breakdown
A behind-the-scenes tour of Veo, Nano Banana, and diffusion at scale
Sander Dieleman, a Google DeepMind research scientist of more than a decade, frames the talk as a “whirlwind tour” of what it actually takes to train generative media systems like Veo and Nano Banana. He stays focused on diffusion, arguing that while language leans autoregressive, audiovisual generation has landed on a different sweet spot.
The unglamorous secret sauce: obsess over the data
His first hard point is that data curation matters more than most people admit. Coming from academia, he says researchers were trained not to touch the dataset so comparisons stayed clean — but in frontier generative modeling, actually looking at the data is often a better use of time than chasing optimizer tweaks, even if the details remain part of the “secret sauce.”
Why nobody wants to diffuse over raw 30-second 1080p video tensors
Dieleman explains that early diffusion models worked directly on pixels, but that breaks once you scale to serious image resolutions or video durations. A single 30-second 1080p, 30 fps clip can be several gigabytes in memory, so modern systems learn their own compressed representations with autoencoders rather than relying on codecs like JPEG or H.265, which make files small but distort the structure neural nets need.
Latent space keeps the grid, throws away just enough detail
The latent representation still looks like an image tensor — just at a much coarser scale — which preserves the topology architectures depend on. Using the classic Stable Diffusion-style example, a 256×256 RGB image becomes something like a 32×32 latent grid with extra channels, and Dieleman notes you can still often tell what animal is in the latent visualization because semantics are preserved while texture and fine detail get compressed.
Diffusion as denoising — and why the one-shot answer is always blurry
He walks through diffusion with a 2D mental model: start from a clean sample, corrupt it with Gaussian noise, then train a denoiser to guess where it came from. The catch is that the inverse problem is ambiguous, so a one-step prediction averages many possibilities and comes out blurry; the magic is taking only a small step toward that prediction, then asking again, sometimes adding a little fresh noise back in so the model doesn’t spiral on its own mistakes.
The best line in the talk: diffusion is basically spectral autoregression
Dieleman then brings in Fourier analysis and shows that natural images have a power-law spectrum, while Gaussian noise is flat across frequencies. That means adding a little noise destroys high-frequency detail first and only later wipes out low-frequency structure, so diffusion naturally generates from coarse semantics to fine detail — his summary is that diffusion is “basically spectral autoregression.”
From U-Nets to transformers, and the hybrid logic of video generation
On architecture, he says U-Nets were the original denoisers and still make intuitive sense for restoration-like tasks, but transformers now work well too, especially because the field can borrow scaling lessons from LLMs. For video, he rejects a false binary: you can jointly diffuse over the full space-time volume, or do a hybrid where time is autoregressive but each frame is generated with diffusion — a setup he says is especially useful for real-time systems like Genie.
Guidance, distillation, and the shift from prompting to real control
The most product-relevant section is on sampling. Classifier-free guidance, he says, became universal after early OpenAI work like GLIDE because amplifying the gap between unconditioned and conditioned predictions massively boosts quality, even if it reduces diversity; cranked too high, though, it can cause telltale saturation and “style” artifacts. He closes on distillation and control: consistency models try to collapse a 50-step denoising path into 1–3 steps, while the next frontier for video is richer conditioning — not just text, but reference identity, camera motion, event timing, and post-training methods like RL or DPO to make models actually do what users want.