Back to Podcast Digest
AI Engineer15m

Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

TL;DR

  • 5 million-token training is possible with standard transformers: Max Ryabinin says Together AI pushed Llama-style training to multi-million-token contexts by combining existing systems tricks with a new context parallelism optimization.

  • Memory, not just compute, is the hidden wall: He highlights two bottlenecks in long-context training, quadratic attention compute and linearly growing memory, arguing the second one is often the more practical blocker.

  • No single trick solves it: Fully Sharded Data Parallelism, DeepSpeed Ulysses context parallelism, activation checkpointing, CPU offloading, and sequence tiling each cut memory, but only the full stack made 3 million tokens fit on an 8x H100 setup.

  • Untitled Ulysses saves memory by reusing smaller attention buffers: Together AI found one set of heads already saturates a GPU, so instead of allocating big buffers for multiple heads at once, it processes head chunks over time and reuses memory with little throughput loss at smaller scales.

  • The trade-off is clear and tunable: Larger attention chunks run faster but use more memory, while smaller chunks conserve memory and extend context length, giving teams a knob to balance throughput against sequence length.

  • Profiling matters because bottlenecks show up in weird places: Ryabinin closes by urging people to inspect training with tools like the PyTorch profiler, since long-context scaling depends on finding the unexpected memory hogs, not just knowing the theory.

The Breakdown

Together AI says it can train standard transformer models at up to 5 million tokens of context by stacking a series of memory-saving tricks, then adding its own "Untitled Ulysses" tweak to squeeze attention activations even further. The punchline is that the real barrier is often memory, not just quadratic compute, and careful profiling plus the right combination of sharding, checkpointing, offloading, and chunking can make seemingly impossible context lengths fit on hardware like an 8x H100 node.

Was This Useful?

Share