Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He
TL;DR
Most video progress is really language progress: Ethan He says modern visual intelligence gains come mostly from prompt rewriting, planning, and language-model reasoning, not from the diffusion backbone alone.
Grok Imagine was built fast because iteration speed beat novelty: xAI shipped Grok Imagine 0.9 in about 3 months with a small team, strong infra, and rapid end-to-end cycles, where fixing data and training pipeline bugs often mattered more than inventing new algorithms.
Training frontier video models is already LLM-scale expensive: Ethan estimates costs are comparable to medium-scale language models, with tens of trillions of visual tokens, 20B-class models, and storage alone reaching tens of petabytes and millions of dollars per month.
World models need three things, not just pretty video: His definition is real-time, interactive, long-horizon video, meaning the system must respond through pixels to mouse, keyboard, or voice, stay coherent over minutes or hours, and do it fast enough to feel live.
Reference-to-video and video extension are stepping stones to full world models: Instead of brute-forcing massive context windows, xAI built features like full-history video extension and up to seven-image reference conditioning to carry characters, objects, and scenes across longer generations.
Video agents are the next commercial inflection: Ethan predicts by the end of the year, production-grade video agents that iteratively call video generation, editing tools, and utilities like ffmpeg will be good enough for ads and enterprise workflows, which is when spending will really ramp.
The Breakdown
xAI built Grok Imagine 0.9 from scratch in three months with just a few engineers, but Ethan He’s bigger claim is the real gains in video generation now come more from language models than diffusion itself. He argues the next wave is video agents and real-time world models, where AI plans, edits, and generates interfaces and long-form video interactively instead of just spitting out short clips.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.