Back to Podcast Digest
AI Engineer15m

Can LLMs generate Enterprise Quality Code? — Prasenjit Sarkar, Sonar

TL;DR

  • Benchmark wins do not equal enterprise quality: Sarkar says pass rates from HumanEval, MBPP, and SWE-bench measure functional correctness, but miss maintainability, security, architectural fit, and tech debt.

  • Sonar's dataset is big enough to expose real tradeoffs: The team evaluated 53-plus model variants on 4,444 distinct Java assignments using SonarQube Enterprise to track bugs, vulnerabilities, cyclomatic complexity, cognitive complexity, and lines of code.

  • High-scoring models can still be wildly verbose: GPT-5.4 and GPT-5.4 Pro High generated about 1.2 million lines of code for the 4,444 assignments, while older models like GPT-4.0 stayed under 250,000.

  • Security risk varies sharply across models: Gemini 3.1 Pro High led on SWE-bench pass rate at 84.17%, while Claude Sonnet 4.6 showed the highest security issue density in his example at about 300 issues per million lines of code.

  • LLM code quality problems come from both training data and model behavior: Sarkar points to mixed-quality open source training corpora, built-in security flaws, hidden logic bugs, limited organizational context, and the probabilistic nature of generation.

  • Sonar's answer is an agent-centric loop called ACDC: The Guide, Verify, Solve workflow uses context augmentation, pre-commit analysis in 1 to 5 seconds, and a remediation agent that fixes issues, recompiles, re-analyzes, and discards changes that would cause regressions.

The Breakdown

Sonar ran 53-plus models across 4,444 Java assignments and found a gap the usual coding benchmarks miss: top models can pass tests while still producing huge amounts of verbose, bug-prone, and security-risky code. Prasenjit Sarkar argues enterprise-ready AI coding needs a second layer of evaluation and a workflow that guides, verifies, and fixes agent-generated code before it lands in production.

Was This Useful?

Share