Fools & AI Scores

April 28, 2026

Fools & AI Scores

Dashboards were never supposed to become the job; Goodhart’s Law gets nastier as AI spreads. Goodhart’s Law says: “When a measure becomes a target, it ceases to be a good measure.” The moment you tie rewards, status, or punishment to a specific metric, people start optimizing for the metric itself rather than the underlying reality it was supposed to represent. Over time, the metric gets gamed, reality gets distorted, and the number stops telling you the truth.

AI is a dangerously susceptible prey. Goodhart’s Law becomes more vicious when AI is involved because the same tool that generates the work can also grade it and defend it. It can produce numbers that feel right but aren't telling the truth.

The modern manager’s fantasy is that messy work can be cleaned up with a score. Score the teacher. Score the banker. Score the model. Score the developer. Tie money and status to the number. Then call it discipline.

On the surface, it makes sense. It always has. The CFPB (Consumer Financial Protection Bureau) reports that Wells Fargo’s sales targets and compensation incentives helped drive employees to open more than 2 million accounts that may not have been authorized by customers. Education researchers found the same pattern in schools: attach high stakes to test scores and both educators and the scores themselves get corrupted.

Economists have historically modeled the same failure in more formal terms. In jobs with several tasks, strong incentives on the easy-to-measure part pull effort away from the hard-to-measure part. Reward the visible task hard enough and people rationally neglect quality, maintenance, mentoring, judgment, or restraint.

AI makes matters worse. Software developers know this since AI started creating leaderboards and commits dashboards, as if more code was indicative of more high quality work. Developers may be aware of this trap, since they know that good things take time, but trouble starts when managers read these metrics as direct measures of output. In fact, GitHub’s own docs quietly advise teams to combine dashboard trends with surveys or retrospectives to get a full picture of productivity.

Take the dumb version first: rewarding developers for AI token spend. The moment you do that, you are paying for longer prompts, bigger context windows, more agent calls, and a higher bill.

Even OpenAI pointed out in 2025 that standard evals reward guessing over admitting uncertainty. On many scoreboards, a model that takes a wild guess can beat a model that honestly says “I don’t know,” because abstention gets 0 while lucky guesses sometimes score as correct. That is Goodhart’s Law inside the eval loop. We optimize for the score, then act surprised when the model optimizes for looking right instead of being right.

A 2025 NAACL paper found that LLM evaluations often ignore non-determinism, that rankings can change across decoding settings, and that best-of-N sampling can radically alter benchmark results. The paper points to a case where selecting the best answer from 256 random generations let Llama-2-7B hit 97.7% on GSM8K, ahead of GPT-4 on that setup.

A separate 2026 paper on LLM-as-a-judge says the central question is still unresolved: can we trust the judgments of LLM judges, given prompt variation and limited human alignment. Clean scores can hide very dirty variance.

And AI has one more advantage over old-fashioned bad metrics: it can argue. Even an old model like GPT-4 with access to participant data was more persuasive than human opponents in multiround debates, winning 64.4% of the time in cases where AI and humans were not equally persuasive. Another study found that deceptive AI-generated explanations were more persuasive than honest ones and could amplify belief in false headlines.

This is why spreadsheet culture becomes dangerous in AI-heavy companies. Numbers do useful work. They compress reality. They surface anomalies. They let you compare periods and spot anomalies. Trouble starts when they replace contact with the underlying processes of reality. A spreadsheet is a map. Management gets lost when it starts treating the map as the territory.

The question is never “Should we measure?” Of course we should. The question is “What disappears when we reward this number?” Reward the appearance of certainty and the honest answer, “I’m not sure,” begins to die.

The real danger is not that people love numbers. The real danger is that numbers end arguments, and AI can now manufacture numbers, explanations, and confidence all at once.