
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Most existing language benchmarks are measuring the wrong thing for modern coding agents — Jo Van Eyck argues that pass@1 tests like McEval (June 2024, 40 languages) and AutoCodeBench (August 2025, ~20 languages) ignore the compile-test-fix loop that real agents use.
Older multilingual results would have pushed you toward odd winners like Haskell or Elixir — McEval showed GPT-4 Turbo hitting 90% on Haskell and 83% on Rust, while AutoCodeBench made Elixir look especially strong, but both were still one-shot benchmarks.
Terminal-Bench gets closer to reality but still doesn’t answer the language question — its Docker-based agent setup evaluates model-plus-harness systems like Claude Code or Codex iterating on real tasks, yet it doesn’t break performance down by programming language.
In Jo’s own small benchmark, TypeScript was the most agent-friendly language overall — across two Exercism-style tasks, five runs per language, and metrics like duration, token usage, and number of build/test iterations, TypeScript consistently came out on top.
Strong typing didn’t deliver the expected advantage — Jo was rooting for F# and Rust, but in his setup they didn’t dominate; F# in particular suffered from high variance and a slow compiler, while Python and Rust were not especially efficient on time or tokens.
The practical takeaway is to benchmark your own stack, not trust internet folklore — Jo spent about $50 on inference using GitHub Copilot, and says even a cheap, imperfect benchmark can tell you more about your context than generic claims about Python or Rust.
Jo opens with a very current AI-dev anxiety: does your programming language actually matter when you’re working with coding agents? He’s hearing the usual contradictory advice — use whatever Frontier Labs are dogfooding, follow the RL training stack, trust the rumors — and decides to stop guessing and benchmark it himself.
The first stop is McEval, a June 2024 benchmark covering 40 languages. Jo likes that it’s multilingual, but points out that its headline metric, pass@1, just means the model gets one shot to generate code and then the benchmark checks whether it compiles and passes tests — which is nothing like how today’s agents actually work. Still, the old results are funny in hindsight: GPT-4 Turbo looked amazing, Haskell hit 90%, Rust 83%, and Jo jokes that if you trusted this alone, “Haskell is the bee’s knees.”
Then he looks at Tencent Research’s AutoCodeBench from August 2025, which feels more relevant because it includes newer models in the Opus 4 era and around 20 languages. The results again produce some spicy language takes — C# around 75%, Elixir around 82% — enough that Jo says he’d be writing Elixir right now if he took it at face value. But once more, the benchmark is still based on one-shot generation, so it misses the whole agentic feedback loop.
Jo gets excited when he finds Terminal-Bench from January 2026 because it finally moves beyond pass@1. It runs agents in Docker containers with small codebases and tasks, and evaluates the full model-plus-harness setup — things like Claude with Claude Code or GPT with Codex — iterating until the job is done. The catch: it ranks harness combinations, not which programming languages are easiest for agents, so it still doesn’t answer the question he actually cares about.
At that point he says, basically, fine — “we have to write a damn bench ourselves.” His benchmark, LangComp, uses two Exercism-style problems — Game of Life and Gilded Rose — across several languages he either likes or expected to do well: C#, F#, Elixir, Rust, Python, JavaScript, and TypeScript. He ran five attempts per problem-language combo using GitHub Copilot because he had “infinite credits,” then logged one big CSV of duration, token usage, and number of failed build/test cycles.
Jo admits he was hoping strong typed languages like F# and Rust would crush this. Instead, TypeScript came out looking best on speed, consistency, and token efficiency, with JavaScript also performing well but without the extra safety net of types. F# didn’t win at all — he blames some of that on the compiler being “pretty darn slow” — and Python and Rust, despite all the hype, were not especially impressive on duration or token consumption.
One of the more practical observations is that most languages converged in roughly two to three build/test iterations, which lines up with prior benchmark findings that a single round of feedback gives agents a huge boost over pass@1. Jo is very explicit that this is a tiny experiment — two problems, five runs, about $50 of inference — not some definitive industry benchmark. But for his own backend-heavy work, the result is strong enough that he plans to steer coding agents toward TypeScript, while telling viewers to spend a few bucks and run their own benchmarks instead of trusting the grapevine.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.