[ATLAS]27 min read

Build an Ungameable Eval

A reference dossier for the AI buyer, head of platform, or applied-AI lead about to sign a contract with an AI vendor (model API, agent platform, copilot, support bot, coding assistant): the eight components of an eval set that decides procurement, the tooling that supports each one, and the line between a private eval and the AI vendor's marketing.

Build an Ungameable Eval

TL;DR

The decision isn't which AI vendor scored highest on a benchmark or which one demoed best on the team's sample prompts. That decision sits in the AI vendor's sales motion. The decision here is whether the team owns an ungameable eval, meaning one the AI vendor can't see, can't train against, can't score, or wait out. Ungameable isn't a claim about vendor honesty; it's a structural property of the eval itself. A vendor that wants to play straight is still bound by what its model trained on, what its judge family knows, and what's changed since the team last refreshed the set.

Evaluation is procurement. An AI vendor that wrote the eval sold the team the answer; a public benchmark is theater the moment it's famous enough to win on; a private eval built once and left unchanged is a museum piece by the second model release. AI vendor benchmarks are adversarial evidence the same way a pitch deck is: useful for forming hypotheses, never enough for a decision.

Eight components make an eval ungameable. They cover dataset, scoring, taxonomy, cadence, baselines, reporting, adversarial coverage, and the procurement lock that ties eval results to contract clauses. Skip any of the eight and the AI vendor gets to grade itself on the one that's missing.

A starting tool stack pairs one eval-of-record platform with a specialist judge layer and an adversarial layer the buyer controls:

  • Braintrust or LangSmith for dataset management, scoring orchestration, and regression tracking
  • Inspect AI for high-sensitivity adversarial and agentic evaluation in sandboxed environments
  • Ragas for retrieval-grounded answer evaluation when the system reads from a corpus
  • Label Studio or Argilla for human labeling and inter-annotator agreement
  • Garak or Promptfoo red-team for prompt-injection and jailbreak coverage

Tooling cost runs from free (Inspect AI, Promptfoo CLI, Helicone Hobby) to $39 per seat per month (LangSmith Plus) and $249 per month (Braintrust Pro), with enterprise plans for hosting on the buyer's own infrastructure when sensitivity requires it. The number worth tracking is the cost of a procurement decision made against a gameable eval.

Buying the eval is the team's job, not the AI vendor's. A platform can store the dataset and run the scorers; it can't decide which failures break the deal.

Three AI Vendors That Sold the Same Eval

A 40-person product team ran a side-by-side bake-off across three AI vendors using a public benchmark suite the team had quietly assembled from open datasets. Every AI vendor scored within two points of the others on the headline number, and the team picked the cheapest. Six months later the chosen AI vendor's tone, refusal posture, and citation quality were visibly different from a competitor the team had ruled out, and the rollback cost a quarter of platform work. The benchmark didn't lie; it just answered a different question. OpenAI's GPT-4 Technical Report reports about 25 percent overlap between GPT-4 pretraining data and HumanEval, with a roughly 2.12 percentage-point degradation after removing contaminated examples. Once a public benchmark is famous enough to win on, it's also famous enough to leak into training data and prompt-engineering folklore. The failure moment wasn't a hallucination; it was the line item that said "evaluated against industry-standard benchmarks." Call it the public-benchmark plateau.

A 200-person SaaS company built an internal eval and ran it across three AI vendors competing for a support-bot replacement. Vendor A scored 87, Vendor B scored 82, Vendor C scored 76. Procurement picked A and started rollout. The Spanish-speaking accounts started complaining inside the first month: tone violations, missed escalations on refund requests, three reported cases of the bot inventing a policy clause that didn't exist. The team rebuilt the eval with a stratified slice: 30 percent English, 25 percent Spanish, 20 percent code-mixed, 15 percent transcripts with audio-to-text errors, 10 percent prompt-injection cases. On the stratified slice, A scored 71, B scored 79, and C scored 81. The team had signed for the wrong AI vendor because the original eval averaged the only failures that mattered into a number that looked fine. Notion's eval program, built on Braintrust with about 70 engineers aligned on evaluation workflows, names the same pattern: the team catches multilingual behavior issues only after building targeted failure datasets for APAC use cases. Call it the missing slice.

A 30-engineer infrastructure team locked in an AI vendor in late 2025 after a clean pilot, scored against an internal eval set the team was proud of. The eval ran twice during procurement and then went into the wiki. Over the next six months the AI vendor shipped two model updates, the team rewrote the system prompt, and the retrieval corpus grew by 40 percent. By spring 2026 the support team's escalation rate was climbing and product was getting customer-success tickets the bot used to handle. Nobody re-ran the eval until a board-prep review. The numbers had dropped 14 points across every category, and the regression had built over months without a single alert. Call it the stale snapshot.

Three teams, three failures, one category mistake. They ran an eval, declared it done, and let the AI vendor own the next move. None of those failures needed a cheating AI vendor; the benchmark leaked, the slice was wrong, and time did the rest.

The lifecycle anchor is the Promptfoo acquisition. OpenAI announced in 2026 that it would acquire Promptfoo, one of the most-used eval and red-team platforms in the buyer market. The technology may be better off inside OpenAI; the buyers who used Promptfoo as their independent governance layer against OpenAI now own a conflict to write down. Choosing an eval platform without thinking about the AI vendor map is its own failure mode.

Figure 1 — Which anti-gaming axis each of the eight components blocks. No single component covers all four; the eight together close the surface.

Action Plan

Days 1 to 7: Write the failure taxonomy before you write the eval. Pull a week of production interactions and tag each one against a draft taxonomy of failure classes for the task. For a support bot, that's wrong policy, invented refund, missed escalation, privacy breach, tone violation, over-refusal, format breakage. For RAG, that's unsupported answer, missing citation, wrong citation, retrieval miss, hallucinated source. Assign severity weights with the function owner, and name the zero-tolerance classes before any AI vendor sees the test. Resist building the dataset this week. The week-one deliverable is a one-page taxonomy with severity weights and named zero-tolerance failures, signed off by the team that owns the budget for failure cleanup.

Days 8 to 14: Build the dataset and the splits. Pull 200 to 500 examples from production, stratified by task class, customer segment, language, data source, and known edge case. Tag each example with task class, risk category, expected answer, severity weight, and whether it's evergreen or drift-sensitive. Split three ways: 40 percent shareable development examples the vendor may see, 25 percent internal validation for team iteration, 35 percent held-out acceptance the AI vendor never sees. Add a smoke set of 25 cases for change-by-change checks. The Friday deliverable is a dataset version 1 in the eval platform, tagged and split, with the held-out acceptance set stored in a project the AI vendor's account can't touch.

Days 15 to 30: Run the bake-off and write the contract. Score the current system, the candidate vendor, and one credible alternative against the held-out set. Calibrate every LLM judge against a human-labeled sample before allowing it into the acceptance run. Report per-category, per-severity, per-example. Map the results into contract schedules: pass criteria, regression definition, cure period, rollback rights, exit terms. Don't sign the contract until the eval results and the contract clauses are in the same review meeting. If the AI vendor refuses private evaluation or refuses to tie regression to remediation, the AI vendor is failing procurement.

The Eight Components and the Rules That Make Each Ungameable

Each sub-case below follows the same template: the component, what ships clean, the ceiling, the rule that makes it ungameable, and the Friday action line. A short table of production examples closes each one.

Dataset Construction and Privacy Isolation

The work is choosing the examples that decide procurement. Where they come from, how they're stratified, how PII is handled, and how the held-out set stays out of the AI vendor's training and optimization loop.

The slot belongs to a buyer-controlled platform with self-hosting available for sensitive workloads. Braintrust's self-hosted setup keeps logs, datasets, prompts, model outputs, human review scores, and judge keys in the buyer's own cloud account (AWS, GCP, or Azure). LangSmith Enterprise offers hybrid and self-host options so data stays inside the buyer's VPC. Weights & Biases Weave runs on Dedicated Cloud or Self-Managed for residency and isolation. Inspect AI is open-source and runs locally, which makes it the default for the highest-sensitivity slices.

What ships clean:

  • Examples drawn from real production inputs, stratified by task class, failure risk, segment, language, and known edge case
  • Three splits with hard boundaries: development the vendor sees, validation the team uses, acceptance the AI vendor never sees
  • Metadata per example: source date, product surface, expected answer, allowed tools, risk category, severity weight, drift-sensitivity flag
  • A continuous pipeline that converts production failures into new held-out cases on a weekly cadence

The ceiling appears at synthetic-only or public-only datasets. Synthetic examples cover the edges the production set won't reach but can't replace it as procurement evidence, and public benchmarks are scouting tools that have leaked into too many training corpora to decide on. Once a public set is famous enough to win on, it's already famous enough to have been trained against. The named failure mode is the public-benchmark plateau.

The eval is ungameable when the AI vendor sees only the bounded development slice, the acceptance set stays in storage the AI vendor's account can't reach, and weekly production failures continuously refresh the hidden pool.

If you start this week, pull 200 production examples, tag each with task class and risk, store the held-out 100 in a project the AI vendor's seat can't access, and write a one-paragraph privacy posture into the procurement file by Friday.

Examples of what this looks like in production:

Use caseDataset shapeStack
B2B SaaS support bot200 production tickets, 100 held-out, weekly drift refreshBraintrust self-hosted
Regulated healthcare workflow500 redacted production examples, air-gapped acceptance setInspect AI on buyer infrastructure
Multilingual consumer support1,000 stratified tickets across five languages with native-speaker reviewLangSmith Enterprise

Scoring Methodology and Judge Architecture

The work is deciding how each output is judged: rule-based checks, code checks, human review, LLM-as-judge, pairwise comparison, mixed scoring.

The slot belongs to a layered architecture. Deterministic rules check anything a machine can verify: schema, refusal policy, citation count, latency, and cost. LLM judges, drawn from a different vendor family than the candidate, handle the dimensions that scale beyond human review, once the team has calibrated them against human-labeled examples. Human reviewers handle the high-stakes calls: tone, legal correctness, and ambiguous factuality. Braintrust, LangSmith, Promptfoo, Weave, and Inspect AI all support layered scoring. The structural choice isn't the platform; it's the judge map.

What ships clean:

  • Deterministic checks for every machine-checkable property (valid JSON, required fields, citation presence, latency, cost)
  • LLM judges drawn from a vendor family that isn't the model under test, with documented calibration against human-labeled samples
  • Human review on a 25 percent double-labeled slice for inter-annotator agreement
  • Pairwise comparison where the question is relative preference, not absolute correctness

The ceiling appears when the model under test grades itself, when the judge comes from the same vendor family as the candidate, or when the rubric is vague enough that every output passes. MT-Bench and Chatbot Arena research identifies position bias, verbosity bias, and self-enhancement bias in LLM judges. OpenAI's own eval guidance says model grading has an error rate, should be validated with human evaluation, and ideally uses a different grading model than the completion model. Promptfoo's default judge picks the same model family as the API key you give it: GPT if you give it an OpenAI key, Claude if you give it an Anthropic key. That's fine when engineers are iterating. It's dangerous in procurement, because the AI vendor you're evaluating ends up grading itself. The named failure mode is judge contamination.

The eval is ungameable when three things hold: the AI vendor can't see the judge setup, every automated judge has been calibrated against human-labeled examples, and the judge model comes from a different vendor family than the candidate.

If you start this week, pick one task class, draft the rule-based checks, write the rubric for the LLM judge, label a 50-example calibration set by hand, and reject any judge that doesn't agree with the human labels above the threshold the team agreed to.

Examples of what this looks like in production:

TaskScoring layerStack
Support intent classificationDeterministic schema check plus LLM judge calibrated against human labelsBraintrust plus Label Studio
Refund eligibility reasoningHuman review on every Severity-1, LLM judge on the bulkLangSmith plus Argilla
Multilingual tone scoringPairwise comparison and native-speaker reviewRagas plus Surge AI

Failure-Mode Taxonomy and Severity Policy

The work is naming what "wrong" means for the task. A generic pass rate hides the only failures that matter to the business.

The slot belongs to the function owner who pays for failure cleanup, not the eval engineer. The taxonomy lives next to the dataset in the eval platform, with severity weights assigned before any candidate sees the test. Braintrust, LangSmith, Promptfoo, Weave, and Inspect AI all support per-example metadata. Label Studio and Argilla help when the taxonomy needs human consensus before it locks.

What ships clean:

  • Every example tagged with one or more failure categories before scoring
  • Severity weights agreed in writing with the function owner before the bake-off starts
  • Zero-tolerance failure classes named explicitly, with the rule that one Severity-1 failure blocks acceptance regardless of aggregate score
  • A misroute review that adds new categories when production surfaces a failure mode the taxonomy didn't anticipate

The ceiling appears when the team reports one aggregate quality number. Aggregates make weak AI vendors look acceptable because catastrophic failures get averaged into the mean. A 92 percent pass rate is useless if the 8 percent includes data leakage, unauthorized refunds, privilege escalation, or invented legal advice. The named failure mode is dashboard theater.

The eval is ungameable when failure categories and severity weights are buyer-owned, hidden from the AI vendor during acceptance, and tied to hard gates the AI vendor can't argue with after the fact.

If you start this week, run a tagging workshop with the function owner and engineering, assign severity weights to the top 12 failure categories, name the zero-tolerance set, and write the policy into the procurement file before any AI vendor runs the eval.

Examples of what this looks like in production:

TaskFailure categories taggedStack
Support botWrong policy, invented refund, missed escalation, tone violationInternal taxonomy doc plus Label Studio
Coding agentSyntax error, security bug, wrong file edited, dependency breakageInternal taxonomy doc plus Inspect AI
RAG applicationUnsupported answer, wrong citation, retrieval miss, hallucinated sourceInternal taxonomy doc plus Argilla

Cadence, Freshness, and Drift Capture

The work is keeping the eval current with the system it scores. Production data drifts as customers and the product evolve, the AI vendor ships new model versions between renewals, and engineering keeps rewriting the system prompt against whatever the current model does best. A frozen eval becomes a launch snapshot that goes blind to all of it.

The slot belongs to platforms that pull production traces back into the eval dataset. Braintrust feeds production traces into datasets for refresh. LangSmith converts online issues into offline test cases and supports dataset versioning with tagged versions for specific runs. Weave connects evaluation with tracing, scorers, CI automations, and alerts. Helicone is useful upstream as observability and trace export, less so as eval-of-record after the Experiments feature was removed on September 1, 2025.

What ships clean:

  • A smoke set of 25 cases on every change to model, prompt, retrieval, tool, or policy
  • A full regression nightly or before merge for high-risk systems
  • A production-sampled drift set weekly, with new failures auto-promoted to candidate eval cases
  • A procurement acceptance set rerun before renewal, AI vendor expansion, or any material model upgrade

The ceiling appears when the eval freezes. A frozen eval is easy to pass, easy to forget, and overfit to the data the team had when the eval was built. AI vendor releases ship faster than annual reviews. The named failure mode is the stale snapshot.

The eval is ungameable when refresh cadence is shorter than the AI vendor's release cycle, the AI vendor can't predict the refresh slice, and production failures automatically create candidate cases on a schedule the AI vendor doesn't see.

If you start this week, set up a weekly job that samples 50 production interactions, runs them through the current scoring layer, surfaces failures to a human reviewer, and pushes any confirmed failure into the held-out pool.

Examples of what this looks like in production:

CadenceTriggerStack
Smoke set (25 cases)Every prompt, model, retrieval, or tool changeBraintrust in CI
Nightly regressionHigh-risk system pre-mergeLangSmith
Weekly drift sample (50 cases)Auto-promote confirmed failures into the held-out poolBraintrust plus LangSmith

Comparative Baseline Design

The work is choosing the comparisons that answer the right questions. A single AI vendor scored in isolation tells the team almost nothing about whether to buy.

The slot belongs to platforms that keep permanent records of each run and support side-by-side comparison. Braintrust experiments are locked once they run, which is the right shape for comparing model, prompt, retrieval, and tool variants without anyone editing the history. LangSmith regression tests highlight regressions and improvements relative to a baseline. Weave supports comparing model objects against the same dataset. Promptfoo handles matrix comparisons across prompts, providers, and assertions. Inspect AI supports benchmark-shaped comparisons across models and agents.

What ships clean:

  • A current-system baseline that answers "does this beat what we already have"
  • A competing-AI-vendor baseline that answers "is this the best buyable option"
  • A human baseline that answers "where does automation become unsafe or uneconomic"
  • A prior-AI-vendor-version baseline that answers "did the upgrade regress"

The ceiling appears when the buyer compares only against the AI vendor's chosen baseline. A vendor will pick a weak baseline, a public set, a stale model, or a metric where it shines, because that's what selling looks like when the buyer hasn't done the choosing. The named failure mode is the seller-picked baseline.

The eval is ungameable when baselines are buyer-selected, run blind where possible, frozen before any candidate runs, and rerun on the same hidden acceptance set after a candidate change.

If you start this week, pick the three baselines that matter for the decision in front of the team, freeze them on the current eval set, and reject any AI vendor pitch that proposes a different baseline mid-pilot.

Examples of what this looks like in production:

BaselineQuestion it answersStack
Current production systemDoes the candidate beat what we already ship?Braintrust experiments
Competing AI vendorIs this the best buyable option?Promptfoo matrix comparison
Prior version of same AI vendorDid the upgrade regress against the same held-out set?LangSmith regression test

Granularity and Reporting

The work is structuring the output of the eval so the team can act on it. A single summary number tells the steering committee something while telling the team responsible for failures almost nothing.

The slot belongs to platforms that combine experiments, traces, comments, human review, and example-level inspection. Braintrust and LangSmith are strongest for cross-functional review. Inspect AI is strongest for engineering-grade reproducibility, weaker for business-user review unless the team builds reporting on top. Weave fits teams already in the W&B stack. Promptfoo is strong for CLI and CI reports.

What ships clean:

  • Overall pass rate, category pass rates, severity-weighted score, p95 latency, cost per call, refusal rate, escalation rate, tool-call success, citation correctness, and critical-failure count, all on the same page
  • Example-level failure inspection so the team can read what actually broke
  • A procurement-shaped summary that opens with the decision, not the score
  • A diagnostic layer for the team that owns remediation, separate from the executive view

The ceiling appears at single-number reporting. A green aggregate makes the buyer feel informed while hiding the failure class that triggers a board escalation later. The named failure mode is dashboard theater.

The eval is ungameable when the AI vendor can't choose the summary number, hide rare failures, or substitute a metric that looks better than the buyer's decision rule.

If you start this week, draft the procurement summary template for the next eval run, name the categories and severities that appear on page one, and reject any AI vendor report that arrives as a single number with no per-category breakdown.

Examples of what this looks like in production:

AudienceReport shapeStack
Function ownerProcurement summary with category passes, severities, costs, latenciesBraintrust
Remediation teamPer-example failures with input, output, judge rationaleLangSmith
Steering committeeDecision-first one-pager backed by per-category drilldownCustom report on top of Braintrust API

Adversarial and Safety Coverage

The work is testing the surfaces AI vendors don't show in demos: prompt injection, jailbreaks, unsafe requests, data leakage, policy bypass, harmful content, tool misuse, and adversarial edge cases that hit production but skip pitch decks.

The slot belongs to specialist red-team tooling stacked on a sandboxed runner. Garak is a vulnerability scanner that probes hallucination, prompt injection, data leakage, misinformation, toxicity, and jailbreaks. Promptmap sends attack prompts and uses a controller model to evaluate success. Promptfoo's red-team plugins cover prompt injection, jailbreak, data leakage, and tool misuse, with the caveat that OpenAI's acquisition introduces a governance footnote when the buyer is evaluating OpenAI. Inspect AI supports tool calling, agent evals, sandboxes, tool approval, and custom scorers, which makes it the default for controlled red-team work. HELM Safety and MLCommons AILuminate are useful public references, not substitutes for buyer-owned cases.

What ships clean:

  • Adversarial cases drawn from the buyer's actual attack surface (real ticket text, real documents, real tool descriptions)
  • Hidden canary cases that test prompt-injection compliance and data leakage without telling the AI vendor what's being tested
  • A refresh process that adds new incidents and exploit patterns to the adversarial pool as they appear
  • A separate severity track for safety failures, with zero-tolerance defaults

The ceiling appears when adversarial coverage is the AI vendor's safety card or a generic jailbreak leaderboard. Public safety benchmarks are reference points, not proof that the AI vendor handles the team's tool access, documents, and escalation rules. The named failure mode is theater coverage.

The eval is ungameable when adversarial cases come from the buyer's real attack surface, the AI vendor doesn't see the canary set, and the refresh process feeds real incidents into the pool.

If you start this week, pick the top three attack surfaces for the system being procured, write 20 adversarial cases per surface from real or realistic inputs, and add a five-case canary subset the AI vendor will never see.

Examples of what this looks like in production:

SurfaceAttack poolStack
Support botPrompt injection in ticket text, attached docs, and email threadsGarak plus Inspect AI
Browser agentCredential, payment, destructive-action, and instruction-hierarchy casesPromptfoo red-team plus Inspect AI
RAG applicationMalicious documents that try to override the system promptGarak plus Promptfoo

Procurement Lock: Decision Rules, Contracts, and Exit Rights

The work is converting eval results into business consequences. A beautiful eval report with no contract leverage produces leverage for the AI vendor, not the buyer.

The slot belongs to procurement and legal, working from the eval results before signature. Public guidance supports the building blocks. The Society for Computers and Law's AI clauses project says AI requirements are often better expressed as benchmarked measurable outcomes than as generic standards. The Forum for Cooperation on AI's procurement framework lists clauses covering purpose, data rights and retention, system performance, monitoring, incident-management SLAs, KPIs, upgrade approvals, lock-in management, and periodic audits. Shumaker's benchmark-clause article makes the procurement point directly: benchmark requirements convert AI promises into leverage for remediation, service credits, or exit rights.

What ships clean:

  • Acceptance thresholds in a contract schedule: overall pass rate, category pass rates, zero-tolerance failures, p95 latency, cost ceiling, data-handling controls
  • A regression definition with named thresholds and cure periods
  • Data non-use clauses that prevent the AI vendor from training on buyer inputs, outputs, feedback, labels, or evaluation materials
  • Audit and isolation rights including subprocessor disclosure, annotation workflows, and model-change notice
  • Termination and exit clauses with prorated refunds, data export, and transition assistance when the eval regresses past the cure threshold

The ceiling appears when the eval produces a memo and the contract ignores it, leaving the buyer with the analysis and no leverage. The named failure mode is the eval that didn't make it to the contract.

The eval is ungameable when the AI vendor knows the decision rule but not the hidden eval set, and the contract gives the buyer remediation rights tied to private-eval regression.

If you start this week, pull the team's draft procurement clauses and add a Buyer Private Evaluation Set provision, a Model Change and Regression provision, and a Data Non-Use provision before any AI vendor sees the next round of negotiation.

Examples of what this looks like in production:

ClauseWhat it doesOwner
Buyer Private Evaluation SetAI vendor cannot see, train against, or score the held-out setProcurement plus Legal
Model Change and RegressionMaterial regression triggers notice, cure period, and rollback rightsLegal
Data Non-UseInputs, outputs, evaluation materials, and labels excluded from trainingLegal plus Security

What the Human Owns Regardless of Tool

The function owner owns authorial intent. A platform can store examples; it can't know which failures the business can't tolerate, and it can't grade the ambiguity sitting in a customer's complaint.

Engineering owns dataset construction. A platform can store rows; it can't pull the right slice from production, redact sensitive fields, or build the stratification the team's actual customers require.

The applied-AI lead owns the taxonomy and the rubric. A model can judge against a rubric only after the rubric is written down, and writing the rubric is the work of the team that owns the budget for failure cleanup.

The applied-AI lead also owns the refresh cadence. A platform can schedule jobs; it can't decide when the market, the product, the policy, or the AI vendor have changed enough to invalidate the previous test.

Procurement and legal own the decision rule. A dashboard can show a score; the buyer has to decide whether that score means buy, reject, cure, renew, downgrade, or exit, and what the contract has to say to make any of those moves real.

Security owns the adversarial surface. A red-team plugin runs the attacks the security lead chose; the lead names which attacks the system actually faces in production, which canaries stay hidden, and which incidents flow back into the pool.

A team that hands taxonomy, rubric, cadence, and decision rule to the AI vendor stops owning procurement. The vendor sells the eval it can grade itself on, and the buyer is the only entity left that can decide what it actually buys.

Cost Calculus and Coexistence

Cheap on the platform line isn't cheap in procurement. Whether the team picks a free CLI tool with no human review and no procurement lock or a premium eval-of-record platform with no buyer-owned hidden set, the eval produces clean reports and an exposed buyer. The math below assumes a buyer running between 200 and 5,000 acceptance examples per procurement decision; enterprise programs push every number up.

Pricing sampled June 1, 2026:

PlatformEntry tierMid tierEnterprise
LangSmith$0 (Developer, 1 seat, 5,000 traces)$39 per seat per month (Plus)Custom, self-host available
Braintrust$0 (Starter, $10 credits, 1 GB)$249 per month (Pro)Custom, self-host data plane
Weave (W&B)$0 (Free, evals + tracing)CustomCustom, Dedicated Cloud or Self-Managed
Promptfoo$0 (Community CLI, 10,000 probes per month)Custom (Enterprise SaaS)Custom On-Prem
Inspect AI$0 (open-source)N/AN/A (buyer-hosted)
Helicone$0 (Hobby, 10,000 requests)Usage-based (Pro, Team)Custom
OpenAI EvalsUsage-based (API plus judge inference)N/APlatform-tied

Then count what the platform line hides:

  • Judge inference, which dominates the bill when LLM-as-judge runs at procurement scale
  • Human review time, especially during calibration and double-labeling
  • Self-hosting operations cost when the buyer runs a data plane internally
  • Annotation vendor cost when outsourced labeling is required, with the leakage risk priced in
  • Adversarial tooling time (Garak, Promptmap, Promptfoo red-team plugins) plus the security review they require
  • Legal time for procurement-clause drafting and AI vendor negotiation
  • The cost of being wrong, which is what the eval is supposed to prevent and rarely appears in the budget review until it has already happened

Five coexistence patterns capture most procurement setups:

  • Single-platform engineering-only: best for small teams running one product surface. Pick Braintrust or LangSmith, run dataset, scoring, regression in one place. Spend lands at $0 to $300 a month; the human-review discipline lives in process, not tooling.
  • Eval-of-record plus specialist red-team: best for most production teams with security exposure. Braintrust or LangSmith plus Promptfoo CLI or Garak in CI. Spend lands at $250 to $1,000 a month; adversarial coverage gets a separate review track.
  • Eval-of-record plus human-labeling platform: best for teams with serious regulated workloads. Braintrust or LangSmith plus Label Studio or Argilla for inter-annotator agreement. Spend lands at $300 to $2,000 a month plus annotation labor.
  • Self-hosted full stack: best for security-sensitive buyers and air-gapped environments. Inspect AI plus Label Studio plus Garak plus a self-hosted Braintrust or LangSmith data plane. Spend lands in operations time more than license fees; the savings is data residency, not dollars.
  • Procurement-only acceptance program: best for buyers running annual AI vendor reviews more than continuous evaluation. Lightweight Promptfoo or Inspect AI for the bake-off, Label Studio for human review, and a contract template that ties acceptance to the eval results. Spend lands under $500 a month with the procurement load front-loaded.

Two platforms earn their seats when one carries the engineering eval workflow and the other carries the adversarial or human-labeling layer the first can't handle cleanly. They don't earn their seats when the second platform exists because nobody's owned the eval and "more tooling" feels safer than picking the discipline. A second platform can't decide which failures the business can't tolerate; it can only run the wrong eval with a different AI vendor logo.

Pitfalls and Anti-Patterns

Using a Public Benchmark as Procurement Proof

Treating MMLU, HumanEval, GSM8K, or a leaked AI vendor leaderboard as decision-grade evidence. Public benchmarks are scouting tools. Once they're famous enough to win on, they're famous enough to leak into training data and prompt-engineering folklore. The fix is the held-out acceptance set built on the buyer's production data.

Using the AI Vendor's Eval Set

Accepting the AI vendor's "internal benchmark" as a fair comparison. Even when the AI vendor is honest about the methodology, the set is vendor-authored evidence. A serious buyer treats it like the demo: useful for forming a hypothesis, never enough for signature.

Using an LLM Judge From the Same Vendor Family as the Model Under Test

Running OpenAI as the judge while evaluating OpenAI's GPT models, or Anthropic as the judge while evaluating Claude. MT-Bench research and OpenAI's own eval guidance both warn against the same-family judge pattern. The fix is to draw the judge from a different vendor family and calibrate against human-labeled samples.

Eval Set Too Small to Support the Decision

Running 30 examples and treating the result as procurement evidence. Eugene Yan's guidance uses the math: with 200 samples and a 3 percent defect rate, the 95 percent confidence interval is roughly 3 percent plus or minus 2.4 percentage points. 50 samples won't tell the team what it needs to know about the rare, catastrophic failures. The fix is sample sizing that matches the decision's blast radius.

Aggregate-Only Reporting

Reporting a single quality number and burying the per-category and per-severity detail. Aggregates make weak AI vendors look acceptable because catastrophic failures get averaged into the mean. The fix is procurement-shaped reporting that opens with the decision and preserves per-example failures underneath.

Leaking the Eval Through Shared Tools

Sharing the held-out set with the AI vendor's support team during a bug investigation. Sending the rubric to an outsourced annotation contractor without isolation. Uploading the prompt library to a shared cloud project. Sending the red-team report to the AI vendor's security desk. Each leakage channel turns the next eval into a less reliable test. The fix is named isolation boundaries and a written non-use clause in the vendor contract.

Letting the AI Vendor Tune Against the Acceptance Set

Treating "internal eval" and "private acceptance set" as the same thing. A development split the vendor sees is fine. An acceptance split the AI vendor never sees is structural. The fix is the three-way dataset split with hard storage boundaries.

Treating Eval as Post-Signing QA

Running the eval the first time after the contract is signed. By then the leverage is gone. The AI vendor's incentive shifts from winning the deal to managing the relationship. The fix is to make the eval the procurement document, not the post-procurement audit.

What to Validate Before Paying for the Stack

The pilot below tests the procurement program against a real AI vendor decision, not against an AI vendor demo. It produces measurable pass-fail gates and a defensible decision.

Before day one. Write the failure taxonomy with severity weights and signed-off zero-tolerance classes. Build the dataset and the three-way split. Decide which eval-of-record platform the team is willing to depend on, and confirm the held-out set lives in storage no AI vendor account can reach. Draft the procurement clauses in advance.

Week one: bake-off, scored honest. Run the current system, the candidate vendor, and one credible alternative against the held-out set. Calibrate every LLM judge against a human-labeled sample before allowing it into the acceptance run. Score per category, per severity, per example. Produce a procurement summary the function owner can read and decide on.

Week two: contract and decide. Map the eval results into the procurement clauses. Negotiate the acceptance threshold, the regression definition, the cure period, the data non-use language, and the exit rights. Don't sign until the contract reflects the eval. If the AI vendor refuses private evaluation, the AI vendor is failing procurement.

Buy only if the pilot wins. The pilot passes only when these gates hold:

  • The held-out acceptance set scores meet the acceptance threshold for the candidate, with no Severity-1 failures
  • The judge calibration cleared the agreement threshold against human labels
  • The refresh cadence is set up and the production drift pipeline is running
  • The contract reflects the acceptance threshold, the regression definition, the cure period, the data non-use clause, and the exit rights
  • The team has a documented rollback path that returns to the prior system within the cure period

Fail the pilot if the eval platform can't:

  • Show per-example inputs, outputs, judge rationale, and human labels for any failure
  • Store the held-out set in isolation from the AI vendor's account
  • Run the same eval against multiple candidates with identical scoring
  • Surface regressions against a frozen baseline
  • Export results into a procurement summary the function owner can act on without an AI vendor present

Methodology

Declared frame: evaluation is procurement, and the eval set is the buyer's evidence file. The dossier maps eight components of an ungameable eval against the tooling that supports each one, layers in pricing and isolation posture sampled June 1, 2026, and treats AI vendor benchmarks as adversarial evidence rather than friendly evidence. Sources consulted: vendor documentation and pricing pages for Braintrust, LangSmith, Weave, Promptfoo, Inspect AI, OpenAI Evals, and Helicone; LLM-as-judge research (MT-Bench, G-Eval, GPTScore, Prometheus); RAG-specific eval framework documentation (Ragas, DeepEval); human-annotation tooling documentation (Label Studio, Argilla, Snorkel); adversarial and safety frameworks (Garak, Promptmap, Promptfoo red-team, HELM Safety, MLCommons AILuminate); customer case studies (Dropbox on Braintrust, Notion on Braintrust, Surge AI on Anthropic); academic literature on benchmark contamination (OpenAI's GPT-4 Technical Report, HumanEval analyses); public procurement-clause guidance (Society for Computers and Law, Forum for Cooperation on AI, Shumaker benchmark-clause article); ARC Prize evaluation design as the public reference for semi-private and private acceptance scoring. In scope: procurement-grade evaluation programs for AI vendors covering models, agent platforms, copilots, support bots, and coding assistants where the buyer runs between 200 and 5,000 acceptance examples per decision and refreshes continuously from production. Out of scope: model-development evaluation for research labs, alignment evaluations against catastrophic risk thresholds (covered separately in lab responsible-scaling frameworks), and consumer-facing crowd-sourced rating systems.

Sources

  1. Braintrust — How Dropbox built an evaluation pipeline for AI search
  2. Braintrust — How Notion uses Braintrust to ship AI features faster
  3. Braintrust — Experiments and evals
  4. Braintrust — AI evaluations guide
  5. Braintrust — Self-hosting architecture
  6. Braintrust — Human review
  7. Braintrust — Pricing
  8. LangSmith — How to evaluate agentic applications
  9. LangSmith — Manage datasets in the LangSmith UI
  10. LangSmith — Dataset versioning
  11. LangSmith — Evaluation concepts
  12. LangSmith — Regression testing
  13. LangSmith — Pricing
  14. Promptfoo — Assertions and metrics
  15. Promptfoo — LLM rubric assertion
  16. Promptfoo — Red team configuration
  17. Promptfoo — Sharing eval results
  18. Promptfoo — Deployment options
  19. Promptfoo — Pricing
  20. OpenAI — OpenAI to acquire Promptfoo
  21. Inspect AI — Inspect AI documentation
  22. OpenAI — Evals
  23. OpenAI Cookbook — Evaluation guide
  24. OpenAI Help Center — Sharing feedback, evaluation and fine-tuning data, and API inputs and outputs with OpenAI
  25. OpenAI — GPT-4 Technical Report
  26. Weights & Biases — Weave evaluations
  27. Weights & Biases — Self-Managed deployments
  28. Weights & Biases — Pricing
  29. Helicone — Experiments
  30. Helicone — Open-source observability and integrations
  31. Helicone — Pricing
  32. Ragas — Available metrics
  33. Ragas — Align LLM as judge with expert judgments
  34. DeepEval — Introduction and metrics
  35. NVIDIA — Garak LLM vulnerability scanner
  36. Promptmap — Prompt injection and jailbreak testing
  37. MLCommons — AILuminate benchmark
  38. Stanford CRFM — HELM Safety
  39. Label Studio — Human consensus and inter-annotator agreement workflows
  40. Argilla — Argilla for human and AI feedback
  41. Snorkel AI — Programmatic labeling and labeling functions
  42. Surge AI — Anthropic RLHF case study
  43. Reuters — Google, Scale AI's largest customer, plans split after Meta deal, sources say
  44. Zheng et al. — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  45. Liu et al. — G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  46. Fu et al. — GPTScore: Evaluate as You Desire
  47. Kim et al. — Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
  48. Zhang et al. — Examining Coding Performance Mismatch on HumanEval
  49. Society for Computers and Law — AI Group launches Artificial Intelligence Contractual Clauses
  50. Forum for Cooperation on AI — Risk Management Framework for Procuring AI Systems
  51. Shumaker — The Artificial Intelligence Benchmark: The Most Important Clause You've Never Used, Part 1
  52. Colin S. Levy — Contracting with AI Vendors: A Practical Guide for Lawyers
  53. Eugene Yan — Product evals for LLM applications
  54. ARC Prize — ARC-AGI private and semi-private evaluation design

Tools Mentioned

Share