LIVE VIBE CHECK: GPT-5.5 Has it all
TL;DR
GPT-5.5 became the team’s default model because it’s both fast and unusually dependable — Dan said it’s his “daily driver,” Mike compared it to a safe Waymo versus Opus as a riskier Tesla, and Austin switched from Claude desktop to Codex for day-to-day growth work because it’s dramatically faster and easier to trust.
On Every’s senior engineer benchmark, GPT-5.5 crushed Opus 4.7: 62 vs. 33 out of 100 — the benchmark tests whether a model can rewrite a messy real codebase the way two human senior engineers would, and GPT-5.5’s edge showed up especially on large rewrites over many hours and many tokens.
The weirdly powerful combo is Opus for planning, GPT-5.5 for execution — Dan’s best benchmark result came when GPT-5.5 followed an Opus 4.7 plan, because Opus writes terse, contract-heavy specs but often hesitates to do the full rewrite, while 5.5 will actually go delete code and push the plan through.
Heavy users said GPT-5.5 finally crossed the line into real vibe coding — Naveen said he used 900 million tokens in a few weeks and built apps like Dayline, a Raycast-style notes/to-do app, in single gigantic threads, including one around 200 million tokens, without looking at the code once.
It’s not a unanimous ‘best model’ verdict: the tradeoff is execution strength vs. generalist taste — Kieran kept Claude/Opus 4.7 as his daily driver for broad product work and design taste, saying GPT-5.5 is more of a specialist that excels in details and execution but can lose coherence at the big-picture level.
OpenAI’s own pitch was that GPT-5.5 matters most inside Codex, not just as a standalone model — Romain Huet and Dominic said internal teams now use Codex for much more than coding, from Slack/Notion-driven tasks to browser use and asset generation, while API access is delayed by “days maximum” for additional safety checks.
The Breakdown
The headline: a new OpenAI workhorse, not just another model drop
Dan opens with basically no suspense: GPT-5.5 is out, Every has been testing it for three weeks, and he thinks it’s “really [__] great.” The early thesis is clear: this is rare because it’s both a serious senior-engineering model and a fast, collaborative day-to-day workhorse, which usually don’t come in the same package.
The benchmark result that got everyone’s attention
The big number in the stream is Every’s senior engineer benchmark, where GPT-5.5 scores 62/100 versus Opus 4.7 at 33. That benchmark asks a model to rewrite a messy real codebase the way a human senior engineer would, and Dan says the gap comes from 5.5’s willingness to actually carry out a big rewrite instead of chickening out and patching around the edges.
Why Opus still matters: it writes the plan GPT-5.5 wants
The twist is that GPT-5.5’s best result came using an Opus 4.7 plan. Dan’s read is that Opus is better at producing terse, spec-like plans with hard constraints — things like “this giant file should end up under 500 lines” — but GPT-5.5 is the one with the nerve to execute that plan over many turns, many hours, and many tokens.
The team splits: safe delegator, product generalist, or all-in believer
Mike says GPT-5.5 feels like “getting into a Waymo” — safe, reliable, and great for tasks you need to delegate without babysitting, like building corporate training curriculum from piles of call notes. Kieran is more hesitant: he says Claude/Opus 4.7 still feels like the better generalist for product work, while GPT-5.5 is stronger in execution and detail but less coherent from a zoomed-out perspective.
Naveen’s 900-million-token stress test
Naveen is the stream’s strongest GPT-5.5 convert, saying older GPTs pushed him back to Claude for writing, support, and vibe coding, but 5.5 changed that. He demoed Dayline, a Raycast-like Mac and iOS to-do app he built from a screenshot in one giant thread — around 200 million tokens — while joking that pink eye gave him the perfect excuse to lie down and vibe code.
What GPT-5.5 is actually better at in long projects
Dan and Naveen keep circling the same character trait: GPT-5.5 is dogged. Where Opus can feel brilliant but restless — “okay, are you ready to wrap this up?” — GPT-5.5 keeps grinding through underspecified, multi-repo, multi-platform tasks like Monologue’s remote MCP support or massive codebase rewrites without losing the thread.
Design is still messy, but the harness story got more interesting
Kieran’s tests show a nuanced picture: GPT-5.5 improved in typography, structure, and restraint, but can still produce visually weird UI decisions, especially compared with Opus 4.7. Then OpenAI’s team added a new angle: use the new image generation model to create the design first, then let GPT-5.5 in Codex implement it, effectively side-stepping some of the model’s weaker taste instincts.
OpenAI joins the stream and explains the bigger picture
Romain Huet and Dominic from OpenAI say GPT-5.5 plus Codex is already changing how teams work internally, including non-engineers using Slack, Notion, browser control, plugins, and artifacts to offload real tasks. They also confirm the API isn’t live yet because OpenAI is taking a cautious rollout, but say it should arrive “extremely soon,” hopefully within days.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.