Back to Podcast Digest
Jo Van Eyck16m

What programming language do AI coding agents prefer?

TL;DR

  • Most existing language benchmarks are measuring the wrong thing for modern coding agents — Jo Van Eyck argues that pass@1 tests like McEval (June 2024, 40 languages) and AutoCodeBench (August 2025, ~20 languages) ignore the compile-test-fix loop that real agents use.

  • Older multilingual results would have pushed you toward odd winners like Haskell or Elixir — McEval showed GPT-4 Turbo hitting 90% on Haskell and 83% on Rust, while AutoCodeBench made Elixir look especially strong, but both were still one-shot benchmarks.

  • Terminal-Bench gets closer to reality but still doesn’t answer the language question — its Docker-based agent setup evaluates model-plus-harness systems like Claude Code or Codex iterating on real tasks, yet it doesn’t break performance down by programming language.

  • In Jo’s own small benchmark, TypeScript was the most agent-friendly language overall — across two Exercism-style tasks, five runs per language, and metrics like duration, token usage, and number of build/test iterations, TypeScript consistently came out on top.

  • Strong typing didn’t deliver the expected advantage — Jo was rooting for F# and Rust, but in his setup they didn’t dominate; F# in particular suffered from high variance and a slow compiler, while Python and Rust were not especially efficient on time or tokens.

  • The practical takeaway is to benchmark your own stack, not trust internet folklore — Jo spent about $50 on inference using GitHub Copilot, and says even a cheap, imperfect benchmark can tell you more about your context than generic claims about Python or Rust.

The Breakdown

The question sounds simple, but the evidence is a mess

Jo opens with a very current AI-dev anxiety: does your programming language actually matter when you’re working with coding agents? He’s hearing the usual contradictory advice — use whatever Frontier Labs are dogfooding, follow the RL training stack, trust the rumors — and decides to stop guessing and benchmark it himself.

McEval: multilingual, interesting, and already outdated

The first stop is McEval, a June 2024 benchmark covering 40 languages. Jo likes that it’s multilingual, but points out that its headline metric, pass@1, just means the model gets one shot to generate code and then the benchmark checks whether it compiles and passes tests — which is nothing like how today’s agents actually work. Still, the old results are funny in hindsight: GPT-4 Turbo looked amazing, Haskell hit 90%, Rust 83%, and Jo jokes that if you trusted this alone, “Haskell is the bee’s knees.”

AutoCodeBench: more recent, same basic flaw

Then he looks at Tencent Research’s AutoCodeBench from August 2025, which feels more relevant because it includes newer models in the Opus 4 era and around 20 languages. The results again produce some spicy language takes — C# around 75%, Elixir around 82% — enough that Jo says he’d be writing Elixir right now if he took it at face value. But once more, the benchmark is still based on one-shot generation, so it misses the whole agentic feedback loop.

Terminal-Bench gets the workflow right, but not the language split

Jo gets excited when he finds Terminal-Bench from January 2026 because it finally moves beyond pass@1. It runs agents in Docker containers with small codebases and tasks, and evaluates the full model-plus-harness setup — things like Claude with Claude Code or GPT with Codex — iterating until the job is done. The catch: it ranks harness combinations, not which programming languages are easiest for agents, so it still doesn’t answer the question he actually cares about.

So he built LangComp himself on a small YouTuber budget

At that point he says, basically, fine — “we have to write a damn bench ourselves.” His benchmark, LangComp, uses two Exercism-style problems — Game of Life and Gilded Rose — across several languages he either likes or expected to do well: C#, F#, Elixir, Rust, Python, JavaScript, and TypeScript. He ran five attempts per problem-language combo using GitHub Copilot because he had “infinite credits,” then logged one big CSV of duration, token usage, and number of failed build/test cycles.

The surprise winner was TypeScript, not the type-system darlings

Jo admits he was hoping strong typed languages like F# and Rust would crush this. Instead, TypeScript came out looking best on speed, consistency, and token efficiency, with JavaScript also performing well but without the extra safety net of types. F# didn’t win at all — he blames some of that on the compiler being “pretty darn slow” — and Python and Rust, despite all the hype, were not especially impressive on duration or token consumption.

One feedback loop matters a lot, and the real lesson is: test your own context

One of the more practical observations is that most languages converged in roughly two to three build/test iterations, which lines up with prior benchmark findings that a single round of feedback gives agents a huge boost over pass@1. Jo is very explicit that this is a tiny experiment — two problems, five runs, about $50 of inference — not some definitive industry benchmark. But for his own backend-heavy work, the result is strong enough that he plans to steer coding agents toward TypeScript, while telling viewers to spend a few bucks and run their own benchmarks instead of trusting the grapevine.

Share