AI EngineerMay 31, 202615m

Can LLMs generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar

TL;DR

Benchmark wins do not equal enterprise quality: Sarkar says pass rates from HumanEval, MBPP, and SWE-bench measure functional correctness, but miss maintainability, security, architectural fit, and tech debt.
Sonar's dataset is big enough to expose real tradeoffs: The team evaluated 53-plus model variants on 4,444 distinct Java assignments using SonarQube Enterprise to track bugs, vulnerabilities, cyclomatic complexity, cognitive complexity, and lines of code.
High-scoring models can still be wildly verbose: GPT-5.4 and GPT-5.4 Pro High generated about 1.2 million lines of code for the 4,444 assignments, while older models like GPT-4.0 stayed under 250,000.
Security risk varies sharply across models: Gemini 3.1 Pro High led on SWE-bench pass rate at 84.17%, while Claude Sonnet 4.6 showed the highest security issue density in his example at about 300 issues per million lines of code.
LLM code quality problems come from both training data and model behavior: Sarkar points to mixed-quality open source training corpora, built-in security flaws, hidden logic bugs, limited organizational context, and the probabilistic nature of generation.
Sonar's answer is an agent-centric loop called ACDC: The Guide, Verify, Solve workflow uses context augmentation, pre-commit analysis in 1 to 5 seconds, and a remediation agent that fixes issues, recompiles, re-analyzes, and discards changes that would cause regressions.

The Breakdown

Sonar ran 53-plus models across 4,444 Java assignments and found a gap the usual coding benchmarks miss: top models can pass tests while still producing huge amounts of verbose, bug-prone, and security-risky code. Prasenjit Sarkar argues enterprise-ready AI coding needs a second layer of evaluation and a workflow that guides, verifies, and fixes agent-generated code before it lands in production.