LIVE VIBE CHECK: GPT-5.5 Has it all
TL;DR
GPT-5.5 became the team’s default model because it’s both fast and unusually dependable — Dan said it’s his “daily driver,” Mike compared it to a safe Waymo versus Opus as a riskier Tesla, and Austin switched from Claude desktop to Codex for day-to-day growth work because it’s dramatically faster and easier to trust.
On Every’s senior engineer benchmark, GPT-5.5 crushed Opus 4.7: 62 vs. 33 out of 100 — the benchmark tests whether a model can rewrite a messy real codebase the way two human senior engineers would, and GPT-5.5’s edge showed up especially on large rewrites over many hours and many tokens.
The weirdly powerful combo is Opus for planning, GPT-5.5 for execution — Dan’s best benchmark result came when GPT-5.5 followed an Opus 4.7 plan, because Opus writes terse, contract-heavy specs but often hesitates to do the full rewrite, while 5.5 will actually go delete code and push the plan through.
Heavy users said GPT-5.5 finally crossed the line into real vibe coding — Naveen said he used 900 million tokens in a few weeks and built apps like Dayline, a Raycast-style notes/to-do app, in single gigantic threads, including one around 200 million tokens, without looking at the code once.
It’s not a unanimous ‘best model’ verdict: the tradeoff is execution strength vs. generalist taste — Kieran kept Claude/Opus 4.7 as his daily driver for broad product work and design taste, saying GPT-5.5 is more of a specialist that excels in details and execution but can lose coherence at the big-picture level.
OpenAI’s own pitch was that GPT-5.5 matters most inside Codex, not just as a standalone model — Romain Huet and Dominic said internal teams now use Codex for much more than coding, from Slack/Notion-driven tasks to browser use and asset generation, while API access is delayed by “days maximum” for additional safety checks.
The Breakdown
The headline: a new OpenAI workhorse, not just another model drop
Dan opens with basically no suspense: GPT-5.5 is out, Every has been testing it for three weeks, and he thinks it’s “really [__] great.” The early thesis is clear: this is rare because it’s both a serious senior-engineering model and a fast, collaborative day-to-day workhorse, which usually don’t come in the same package.
The benchmark result that got everyone’s attention
The big number in the stream is Every’s senior engineer benchmark, where GPT-5.5 scores 62/100 versus Opus 4.7 at 33. That benchmark asks a model to rewrite a messy real codebase the way a human senior engineer would, and Dan says the gap comes from 5.5’s willingness to actually carry out a big rewrite instead of chickening out and patching around the edges.
Why Opus still matters: it writes the plan GPT-5.5 wants
The twist is that GPT-5.5’s best result came using an Opus 4.7 plan. Dan’s read is that Opus is better at producing terse, spec-like plans with hard constraints — things like “this giant file should end up under 500 lines” — but GPT-5.5 is the one with the nerve to execute that plan over many turns, many hours, and many tokens.
The team splits: safe delegator, product generalist, or all-in believer
Mike says GPT-5.5 feels like “getting into a Waymo” — safe, reliable, and great for tasks you need to delegate without babysitting, like building corporate training curriculum from piles of call notes. Kieran is more hesitant: he says Claude/Opus 4.7 still feels like the better generalist for product work, while GPT-5.5 is stronger in execution and detail but less coherent from a zoomed-out perspective.
Naveen’s 900-million-token stress test
Naveen is the stream’s strongest GPT-5.5 convert, saying older GPTs pushed him back to Claude for writing, support, and vibe coding, but 5.5 changed that. He demoed Dayline, a Raycast-like Mac and iOS to-do app he built from a screenshot in one giant thread — around 200 million tokens — while joking that pink eye gave him the perfect excuse to lie down and vibe code.
What GPT-5.5 is actually better at in long projects
Dan and Naveen keep circling the same character trait: GPT-5.5 is dogged. Where Opus can feel brilliant but restless — “okay, are you ready to wrap this up?” — GPT-5.5 keeps grinding through underspecified, multi-repo, multi-platform tasks like Monologue’s remote MCP support or massive codebase rewrites without losing the thread.
Design is still messy, but the harness story got more interesting
Kieran’s tests show a nuanced picture: GPT-5.5 improved in typography, structure, and restraint, but can still produce visually weird UI decisions, especially compared with Opus 4.7. Then OpenAI’s team added a new angle: use the new image generation model to create the design first, then let GPT-5.5 in Codex implement it, effectively side-stepping some of the model’s weaker taste instincts.
OpenAI joins the stream and explains the bigger picture
Romain Huet and Dominic from OpenAI say GPT-5.5 plus Codex is already changing how teams work internally, including non-engineers using Slack, Notion, browser control, plugins, and artifacts to offload real tasks. They also confirm the API isn’t live yet because OpenAI is taking a cautious rollout, but say it should arrive “extremely soon,” hopefully within days.