Let LLMs Wander: Engineering RL Environments — Stefano Fiorucci
TL;DR
RL environments are becoming the new scaling surface for LLMs — Stefano Fiorucci argues that after pretraining hit diminishing returns, systems like OpenAI’s o1 and DeepSeek-R1 showed how reinforcement learning with verifiable rewards can improve reasoning by letting models explore trajectories instead of just imitating supervised examples.
The key shift is from datasets to dynamic environments — in this framing, the LLM is the agent, the environment includes task logic and scoring rules, and rewards can come from automatically checkable outcomes like correct math answers, valid tool calls, or winning a tic-tac-toe game.
Verifiers is built to make RL environments reusable software, not one-off hacks — the open-source library from Prime Intellect supports single-turn, multi-turn, and tool-using environments, plugs into OpenAI-compatible endpoints and local models via vLLM, and handles async rollouts so builders can focus on task logic and rewards.
A weak small model can become shockingly strong with the right environment design — starting from Liquid AI’s LFM2, which often failed formatting and invalid moves, Fiorucci used 200 synthetic SFT examples from GPT-5 Mini plus RL training to produce a model that dominates random players and draws 85% of games against an optimal tic-tac-toe opponent.
Environment design details mattered as much as the RL algorithm — he had to soften invalid-move penalties, control opponent difficulty with minimax plus randomness, seed turns to reduce GRPO noise, and use stratified sampling because small batch sizes below 256 caused instability and even model collapse.
The final trained model beat its own teacher on the target task — after a second RL phase with stronger opponents and higher exploration temperature, Fiorucci’s tic-tac-toe specialist matched GPT-5 Mini against random players and outperformed it against optimal opponents, a concrete example of small-task specialists surpassing larger closed models.
The Breakdown
Why RL environments matter right now
Stefano Fiorucci opens by framing RL environments as one of the most exciting shifts in LLM post-training: places where models can interact, explore, use tools, and get feedback instead of just copying examples. He ties that to the bigger industry moment too — startups in the space are getting serious funding, while reports from DeepSeek and MiniMax point to thousands of environments being used to scale model performance.
From classic RL to LLMs that can reason
He gives the quick RL refresher — agent, environment, state, action, reward, trajectory — then maps it cleanly onto language models. The contrast with standard pretrain → SFT → RLHF is the key move: in supervised fine-tuning, the model imitates curated prompt-response pairs, while in reinforcement learning with verifiable rewards it explores many possible reasoning paths and gets rewarded for outcomes that can be automatically checked.
The DeepSeek-style idea: reward what you can verify
This is where OpenAI’s o1 and DeepSeek-R1 come into the story. Fiorucci highlights the core idea: ask the model for both a reasoning trace and an answer, verify the answer automatically, and use that reward to train the policy — a setup that works not just for math, but for any task with machine-checkable success, like tool use or game outcomes.
Verifiers: turning environments into actual software artifacts
He introduces Verifiers, an open-source library by Prime Intellect that packages RL environments as installable Python software. The pitch is practical: it supports single-turn, multi-turn, and tool environments, parses outputs, computes rewards, abstracts model serving behind OpenAI-compatible APIs, and plugs into trainers like PrimeRL so you spend time on environment logic instead of infrastructure.
The building blocks: reverse-text, double-check, and tool use
Stefano walks through examples that make the abstractions feel concrete. A simple reverse-text environment loads a dataset, parses XML-tagged answers, and scores outputs with longest-common-subsequence ratio; a “double check” math environment shows how multi-turn state works by asking “Are you sure?”; and tool environments let the model call Python-defined tools until it’s ready to answer.
Tic-tac-toe becomes a serious RL testbed
Then the talk gets fun. Tic-tac-toe sounds toy-like, but he uses it to show why static datasets fall short for interactive tasks: the model has to track board state, produce moves in XML tags, react over multiple turns, and cope with different opponents. His environment handles board state, win conditions, random or minimax opponents, and a reward mix of win signal plus formatting bonus.
The messy engineering that made training work
A lot of the real lesson is in the details he learned the hard way. He made the game less punishing for small models by replacing immediate loss on invalid moves with a -0.1 penalty, added think-tag checks, seeded examples and turns so identical board states produce identical opponent moves, and used stratified sampling so batches had balanced opponent difficulty — all to reduce noise in GRPO/CISPO training.
From weak model to tic-tac-toe master
He evaluates GPT-5 Mini and Liquid AI’s LFM2, finds a huge gap, and uses GPT-5 Mini to generate 200 synthetic SFT examples as a warm start. After RL training with batch sizes large enough to avoid collapse, the model becomes strong enough to dominate random opponents and draw 85% against an optimal player; after a second phase with tougher opponents and more exploration, it becomes a true specialist that even beats GPT-5 Mini against optimal play. His closing message is simple and memorable: don’t just show a model how to do a task — give it a space to play, a reward signal, and then “go for a walk.”