Jo Van EyckMay 10, 202616m

What programming language do AI coding agents prefer?

TL;DR

Most existing language benchmarks are measuring the wrong thing for modern coding agents — Jo Van Eyck argues that pass@1 tests like McEval (June 2024, 40 languages) and AutoCodeBench (August 2025, ~20 languages) ignore the compile-test-fix loop that real agents use.
Older multilingual results would have pushed you toward odd winners like Haskell or Elixir — McEval showed GPT-4 Turbo hitting 90% on Haskell and 83% on Rust, while AutoCodeBench made Elixir look especially strong, but both were still one-shot benchmarks.
Terminal-Bench gets closer to reality but still doesn’t answer the language question — its Docker-based agent setup evaluates model-plus-harness systems like Claude Code or Codex iterating on real tasks, yet it doesn’t break performance down by programming language.
In Jo’s own small benchmark, TypeScript was the most agent-friendly language overall — across two Exercism-style tasks, five runs per language, and metrics like duration, token usage, and number of build/test iterations, TypeScript consistently came out on top.
Strong typing didn’t deliver the expected advantage — Jo was rooting for F# and Rust, but in his setup they didn’t dominate; F# in particular suffered from high variance and a slow compiler, while Python and Rust were not especially efficient on time or tokens.
The practical takeaway is to benchmark your own stack, not trust internet folklore — Jo spent about $50 on inference using GitHub Copilot, and says even a cheap, imperfect benchmark can tell you more about your context than generic claims about Python or Rust.

Summary

The question sounds simple, but the evidence is a mess

Jo opens with a very current AI-dev anxiety: does your programming language actually matter when you’re working with coding agents? He’s hearing the usual contradictory advice — use whatever Frontier Labs are dogfooding, follow the RL training stack, trust the rumors — and decides to stop guessing and benchmark it himself.

McEval: multilingual, interesting, and already outdated

The first stop is McEval, a June 2024 benchmark covering 40 languages. Jo likes that it’s multilingual, but points out that its headline metric, pass@1, just means the model gets one shot to generate code and then the benchmark checks whether it compiles and passes tests — which is nothing like how today’s agents actually work. Still, the old results are funny in hindsight: GPT-4 Turbo looked amazing, Haskell hit 90%, Rust 83%, and Jo jokes that if you trusted this alone, “Haskell is the bee’s knees.”

AutoCodeBench: more recent, same basic flaw

Then he looks at Tencent Research’s AutoCodeBench from August 2025, which feels more relevant because it includes newer models in the Opus 4 era and around 20 languages. The results again produce some spicy language takes — C# around 75%, Elixir around 82% — enough that Jo says he’d be writing Elixir right now if he took it at face value. But once more, the benchmark is still based on one-shot generation, so it misses the whole agentic feedback loop.

Terminal-Bench gets the workflow right, but not the language split

Jo gets excited when he finds Terminal-Bench from January 2026 because it finally moves beyond pass@1. It runs agents in Docker containers with small codebases and tasks, and evaluates the full model-plus-harness setup — things like Claude with Claude Code or GPT with Codex — iterating until the job is done. The catch: it ranks harness combinations, not which programming languages are easiest for agents, so it still doesn’t answer the question he actually cares about.

So he built LangComp himself on a small YouTuber budget

At that point he says, basically, fine — “we have to write a damn bench ourselves.” His benchmark, LangComp, uses two Exercism-style problems — Game of Life and Gilded Rose — across several languages he either likes or expected to do well: C#, F#, Elixir, Rust, Python, JavaScript, and TypeScript. He ran five attempts per problem-language combo using GitHub Copilot because he had “infinite credits,” then logged one big CSV of duration, token usage, and number of failed build/test cycles.

The surprise winner was TypeScript, not the type-system darlings

Jo admits he was hoping strong typed languages like F# and Rust would crush this. Instead, TypeScript came out looking best on speed, consistency, and token efficiency, with JavaScript also performing well but without the extra safety net of types. F# didn’t win at all — he blames some of that on the compiler being “pretty darn slow” — and Python and Rust, despite all the hype, were not especially impressive on duration or token consumption.

One feedback loop matters a lot, and the real lesson is: test your own context

One of the more practical observations is that most languages converged in roughly two to three build/test iterations, which lines up with prior benchmark findings that a single round of feedback gives agents a huge boost over pass@1. Jo is very explicit that this is a tiny experiment — two problems, five runs, about $50 of inference — not some definitive industry benchmark. But for his own backend-heavy work, the result is strong enough that he plans to steer coding agents toward TypeScript, while telling viewers to spend a few bucks and run their own benchmarks instead of trusting the grapevine.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

What programming language do AI coding agents prefer?

Summary

The question sounds simple, but the evidence is a mess

McEval: multilingual, interesting, and already outdated

AutoCodeBench: more recent, same basic flaw

Terminal-Bench gets the workflow right, but not the language split

So he built LangComp himself on a small YouTuber budget

The surprise winner was TypeScript, not the type-system darlings

One feedback loop matters a lot, and the real lesson is: test your own context

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

The question sounds simple, but the evidence is a mess

McEval: multilingual, interesting, and already outdated

AutoCodeBench: more recent, same basic flaw

Terminal-Bench gets the workflow right, but not the language split

So he built LangComp himself on a small YouTuber budget

The surprise winner was TypeScript, not the type-system darlings

One feedback loop matters a lot, and the real lesson is: test your own context

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks