[ATLAS]June 2, 202627 min read

Classify the Work, Not the Model

A reference dossier for the engineering manager, applied-AI lead, or founder already paying for two or three model vendors: the ten task classes a team runs every week, the model tier that earns each one, and the routing decisions that decide whether the monthly bill grows linearly with usage or grows like a horror story.

TL;DR

The decision isn't Claude versus GPT versus Gemini. That decision sits in vendor procurement. The decision here is which class the work belongs to, and which tier of model the class can ride on without producing a quality regression a human will catch a month later in a customer-facing channel.

The right model is the cheapest one that doesn't fail the task. Frontier models are the easy default and the most common overspend, while cheap models are the easy savings and the most common silent regression. The fix is to classify the work, not the model.

Ten task classes sit underneath every production AI stack. Structured extraction, classification, constrained summarization, and shallow code assist belong to the efficiency tier. Multi-source synthesis, open-ended generation, and tool-using workflow agents belong to the mid tier. Autonomous code, hard reasoning with explicit deliberation, and high-risk agentic loops belong to the frontier. Vision sits where the multimodal pricing wins by a band.

A starting stack pairs one frontier model the team trusts for high-stakes work with one mid-tier default and one efficiency-tier workhorse for the long tail.

Frontier tier:

Claude Opus 4.8 for autonomous engineering and high-stakes synthesis
GPT-5.5 for professional analysis and structured agentic work
Gemini 2.5 Pro for long-context reasoning when frontier price matters

Mid tier:

Claude Sonnet 4.6 as the balanced default for synthesis, writing, and agents
GPT-5 for reasoning-heavy work below the GPT-5.5 ceiling
Gemini 2.5 Flash for high-volume mid calls and multimodal work

Efficiency tier:

Claude Haiku 4.5 for fast Claude-compatible extraction and sub-agents
Gemini 2.5 Flash-Lite for extraction, classification, and simple vision
GPT-5.4 nano for OpenAI-native cheap defaults
DeepSeek V4 Flash for extreme-cost workloads with vendor-risk acceptance

Monthly model spend on a small team runs $200 to $3,000 when the routing is honest; the same workload run frontier-by-default lands at $10,000 to $50,000. The gap is the routing.

Classifying the calls is the team's job, not the platform's. A router can encode the decision; it can't make it.

Three Teams That Paid for the Same Mistake

A 30-seat company put GPT-5.5 behind its support triage at a steady three-million-call month. The work was classification: read a ticket, return one of nine labels and a confidence score, hand off below the threshold. The bill came in at $4,200, and the founder asked engineering why a classification task cost more than the customer-success seat. The answer was that nobody had ever picked a model for that endpoint; the first engineer used GPT-5.5 because the docs had a copy-paste example, and the routing never got revisited. Running the same workload on Gemini 2.5 Flash-Lite costs about $200 a month at the published $0.10 input and $0.40 output per million, and the eval set showed no meaningful regression on the existing nine-label taxonomy. The failure moment wasn't a hallucination; it was the line item nobody owned. Call it frontier-by-default.

A content team running an automated draft pipeline on Claude Haiku 4.5 noticed editorial scoring drifting downward across the quarter. Haiku is fast and cheap, and on summary work it's competent. The drafts looked clean on first read; the editors were catching subtle factuality issues two reviews deep, and the editor-rework time per draft crept from 18 minutes to 42 minutes. The savings on the model line ($1.20 input and $5 output per million versus Sonnet's $3 and $15) were under $4,000 a month. The lost editor time, measured against fully-loaded hourly cost, was $11,000. The team had downgraded a synthesis-and-voice job into a class that only earns the slot when the output is templated. The fix was to escalate authored work back to Sonnet 4.6 and leave Haiku on the constrained-summary work where it still ships clean. Call it the cheap-model regression.

A 200-engineer organization ran Claude Code at scale, on a flat policy that routed every loop through Opus 4.8 because "coding is high stakes." Anthropic's own published Claude Code data shows enterprise deployments average around $13 per developer per active day with 90 percent below $30. Run the math on 200 developers and 20 active days: about $52,000 a month, with the curve climbing as autonomy scales. The team's actual bill cleared $70,000 because they hadn't separated inline completion (which doesn't need Opus), function-level edits (which mostly don't need Opus), and autonomous multi-file engineering (which is where Opus earns its slot). Once they split shallow code assist down to GPT-5 mini and held Opus for the planning-and-multi-file loops, the bill came back inside the $25,000 to $30,000 range without a measurable regression on the work that mattered. Call it one-tier-fits-all.

Three teams, three different mistakes, one category. They picked the model before they classified the work, and then they let the choice ossify because nobody owned the routing as a recurring decision rather than a one-time setup.

The lifecycle anchor is the model-deprecation cycle. Mistral Large 2 is retired, DeepSeek's older deepseek-chat and deepseek-reasoner aliases are being deprecated in favor of V4 Flash and V4 Pro, and OpenAI's GPT-5 nano now sits behind GPT-5.4 nano for most new speed-sensitive workloads. Hard-coding a model ID without a deprecation plan is its own failure mode; the routing layer needs to abstract the model from the call site, or the call site has to be renamed every quarter.

Figure 1 — The default tier earning each task class today, with the cost band per million calls and the routing rule that holds the slot.

If You Read Nothing Else

The 30-Day Routing Cycle

Ten task classes sit underneath every production AI stack, and each one has a default tier. Structured extraction, classification, constrained summarization, and shallow code assist run cleanly on the efficiency tier (Gemini Flash-Lite, Claude Haiku, GPT-5.4 nano, DeepSeek V4 Flash) at $200 to $2,500 per million calls. Multi-source synthesis, open-ended generation, tool-using agents, and vision interpretation run on the mid tier (Claude Sonnet 4.6, Gemini 2.5 Flash, GPT-5) at $2,000 to $30,000 per million calls. Autonomous code, high-risk agentic loops, and hard reasoning with explicit deliberation run on the frontier (Claude Opus 4.8, GPT-5.5, Gemini 2.5 Pro, o-series) at $30,000 to $600,000 per million calls. The number worth tracking is monthly spend by task class, not by model.

Days 1 to 7: Classify the calls, not the models. Pull a representative week of production AI calls and tag each one by task class against the ten in this dossier. Record the current model, the actual token profile (input and output), the latency the consumer felt, and the failure rate the team accepts. Resist routing changes this week. The week-one output is a one-page table showing where the team's current model spend lands by class, with the current model in one column and the natural target tier in the next. The first surprise is usually that one or two endpoints are eating most of the bill, and a third are running on frontier when their target tier was always efficiency.

Days 8 to 14: Build the eval harness for the top five. Pick the three classes with the highest call volume and the two classes with the highest failure cost. Build a private eval set of 50 to 200 real examples per class, scored by a human or by a judge model the team doesn't share with the vendor under test. Score the current model and the candidate downgrade or upgrade against the same set. The Friday deliverable is a routing-decision spreadsheet with the per-class quality delta and the per-million cost delta, plus a recommendation for each of the five.

Days 15 to 30: Ship the routing and watch the bill. Configure routing through a gateway (LiteLLM, OpenRouter, Portkey, Helicone) or through vendor SDKs with a routing module the team owns. Send each class to its default. Hold the eval on a daily cadence for the first week, then weekly. Watch the actual monthly bill against the prior month after two complete billing cycles, and only recognize the saving after the regression test has held for both. If the saving is real, write it down by class as part of the routing config so the next engineer who touches the endpoint sees the decision in the same place as the code.

The Ten Classes and the Tiers That Earn Them

Each sub-case below follows the same template: the slot names the default tier, the "what ships clean" bullets describe the deliverable, the ceiling names the failure mode, and the action line is the Friday deliverable.

Structured Extraction and Schema Filling

The work is pulling named fields from documents, emails, transcripts, or chat logs against a known schema. Inputs and outputs are deterministic; the model is doing parsing, not judgment.

The slot belongs to the efficiency tier. Gemini 2.5 Flash-Lite is the price-leader default at $0.10 input and $0.40 output per million; DeepSeek V4 Flash is close behind when vendor risk is acceptable. GPT-5.4 nano holds the slot for teams already running OpenAI-native. Mistral OCR sits in front of the LLM call when the document is scanned and layout matters more than language.

What ships clean:

A JSON object that parses on the first attempt, with every required field present and every optional field either filled or explicitly null
A confidence or abstention signal the team can route on, not a silent guess
A validation step that rejects malformed output and routes failures to a retry on the same tier before escalating
A per-million cost the team can defend against the alternative (manual extraction, OCR-only, regex-only)

The ceiling appears at extraction where the schema is genuinely ambiguous. When the model has to decide whether "Acme, Inc." and "Acme Inc" are the same legal entity, or which of two phone numbers in the document is the billing contact, the work has graduated to multi-source synthesis. The named failure mode is the silent-coercion error: the cheap model produces a parseable record with the wrong values.

If you start this week, pick one extraction endpoint with at least 100,000 monthly calls. Build a 100-example eval against your current model, run the same eval on Gemini Flash-Lite, and ship the swap by Friday if the regression is under your tolerance.

Examples of what this looks like in production:

Source	Fields	Tier
Inbound invoice email	Vendor, amount, due date, line items, PO reference	Gemini 2.5 Flash-Lite
Customer signup form	Name, email, role, company size, intent	GPT-5.4 nano
Scanned contract pages	Parties, effective date, term, renewal clause	Mistral OCR plus Haiku 4.5

Classification and Routing

The work is putting an inbound item into one of a small set of labels: support intent, lead temperature, ticket category, expense category, content type. Judgment is real but bounded; the taxonomy is small enough to enumerate.

The slot belongs to the efficiency tier. Gemini 2.5 Flash-Lite is the default at $0.10 input and $0.40 output per million; GPT-5 nano and Llama 3.3 70B via OpenRouter compete on cost when the label set is stable and the eval clears them.

What ships clean:

A label, a confidence score, and a one-sentence rationale, returned as structured JSON
A taxonomy with fewer than 15 labels, with three to five labeled examples per class plus an "unknown / needs human" class
A confidence threshold the team can tune, with everything below the threshold queued for human review
A weekly misroute review where production examples that the model labeled wrong get added to the eval set

The ceiling appears when the team expands the taxonomy to cover every edge case. Past 15 labels, the work is investigation, not classification, and the slot has moved up a tier. The named failure mode is the overconfident misroute on the ambiguous intent: the cheap model picks a confident wrong label because the prompt didn't give it a way to abstain.

If you start this week, pull 200 real labeled items, run them through the cheapest model that supports structured output, publish precision and recall by label, and route only above 0.85 confidence by Friday.

Examples of what this looks like in production:

Inbound	Labels	Tier
Support tickets	refund / bug / how-to / sales / spam / unknown	Gemini 2.5 Flash-Lite
Inbound leads	hot / warm / cold / out-of-ICP / spam	GPT-5 nano
Expense receipts	meals / travel / software / contractor / other	Haiku 4.5

Constrained Summarization and Rewrite

The work is producing a summary at a known length, tone, and structure. Customer-visible ticket summaries, email drafts of fixed shape, daily stand-up notes from raw transcripts, meeting recaps in a template the team has settled on.

The slot belongs to the mid tier when the output is customer-visible, and the efficiency tier when the output is internal-only. Gemini 2.5 Flash earns customer-facing summaries at $0.30 input and $2.50 output per million; Haiku 4.5 holds the slot when the tone has to feel Claude-shaped. GPT-5 mini works when the team is OpenAI-native and the editor doesn't need a particular voice.

What ships clean:

A summary at the requested length, tone, and structure, with no invented quotes or fabricated names
Factual coverage the team can verify against the source in under two minutes
An editor-rework time on first review under the bar the team agreed to in week one
A consistent format that downstream systems (CRM fields, email templates, ticket panels) can consume without manual cleanup

The ceiling appears when the summary has to take a position the source doesn't take. Compressing a customer call into a sales rec, condensing three meetings into a roadmap update, or distilling a thread into an executive view requires synthesis, not summarization. The named failure mode is the cheap-model regression: the model produces competent mush that passes a glance and fails the second read.

If you start this week, pick one summary endpoint with at least 1,000 monthly calls, build a 50-example eval with editor-rework time as the metric, and ship the swap by Friday if the time is within tolerance.

Examples of what this looks like in production:

Input	Output	Tier
Sales call transcript	Two-paragraph CRM activity note	Gemini 2.5 Flash
Inbound support ticket thread	Three-line summary for the agent panel	Haiku 4.5
Weekly engineering Slack threads	Bulleted stand-up note for the PM	GPT-5 mini

Multi-Source Synthesis and Reconciliation

The work is reading across multiple sources, identifying agreement and conflict, and producing a judgment. Reconcile an invoice against contract terms. Compare a vendor pitch against the team's evaluation rubric. Cross-check a customer's stated issue against their account history and the product's release notes.

The slot belongs to the mid tier as the default and the frontier when the answer drives a high-stakes decision. Claude Sonnet 4.6 at $3 input and $15 output per million is the workhorse. Gemini 2.5 Pro and GPT-5 sit at similar mid-frontier pricing for cost-sensitive variants, and Opus 4.8 holds the slot when the team needs the deeper synthesis Sonnet doesn't quite reach.

What ships clean:

A reconciliation that names each source, identifies points of agreement and disagreement, and states uncertainty where the sources don't decide
A bounded output schema that downstream systems can consume (a discrepancy list with evidence links, a structured comparison table, a recommendation with rationale)
A human-readable rationale that a reviewer can audit without re-reading every source
An eval that scores false reconciliation (the model makes inconsistent sources look consistent) higher than other failure modes, because that failure is the one the human will miss

The ceiling appears when the synthesis requires multi-step reasoning the model can't show its work for. When the answer depends on a derivation, an explicit plan, or a high-stakes judgment, the work has moved into hard reasoning with explicit deliberation. The named failure mode is the false reconciliation: the model paves over a real conflict because the prompt asked for a single answer.

If you start this week, pick one synthesis endpoint with high decision-cost-per-output, build a 30-example eval scored by a human on conflict-detection and source-attribution accuracy, and validate Sonnet 4.6 as the default before considering frontier escalation.

Examples of what this looks like in production:

Sources	Output	Tier
Invoice, contract, prior payments	Discrepancy report with evidence links	Claude Sonnet 4.6
Vendor questionnaire, internal policy, prior audit	Risk score plus reviewer packet	GPT-5 or Gemini 2.5 Pro
Customer ticket, account history, release notes	Recommended response with cited rules	Sonnet 4.6

Shallow Code Assist

The work is inline completion, function-level edits, small test additions, localized diffs. The model is helping the engineer move within a file or a small set of files; the engineer owns the loop.

The slot belongs to the mid tier of code-specialized models. GPT-5 mini at $0.25 input and $2 output per million is strong; Codestral and Devstral hold their seats when the team prefers an open-weight option; DeepSeek V4 Flash is the price floor when vendor risk is acceptable.

What ships clean:

A diff that compiles, passes the existing tests, and follows the repo's style without manual cleanup
A response time under the bar that keeps the engineer in flow (under two seconds for inline completion, under ten seconds for function-level edits)
A failure mode where the engineer can see the suggestion is wrong before applying it, not after
A cost per accepted suggestion that the team can defend against engineering time saved

The ceiling appears when the edit needs to understand more than one file. The moment the suggestion has to reason about an import chain, a database schema, or a service boundary, the work has moved into autonomous code and multi-file engineering. The named failure mode is the looks-right-locally edit: the model produces a change that compiles in isolation and breaks an adjacent call site.

If you start this week, pick one IDE assistant endpoint, instrument suggestion acceptance rate by file size and edit scope, and route inline completion to GPT-5 mini while reserving the frontier for the multi-file loop.

Examples of what this looks like in production:

Edit shape	Target	Tier
Inline completion in a function body	Repo-aware completion under 200 ms	GPT-5 mini
Function-level refactor in one file	Working diff with passing tests	Codestral or GPT-5 mini
Test scaffolding for an existing function	Spec file with realistic assertions	DeepSeek V4 Flash

Autonomous Code and Multi-File Engineering

The work is the loop. The model plans, navigates the repo, edits multiple files, runs tests, debugs, and decides when to stop. The engineer reviews the result; the model owns the path.

The slot belongs to the frontier. GPT-5.5 and Claude Opus 4.8 are the only models that consistently hold the loop without thrashing on the harder cases. Sonnet 4.6 holds it with stricter bounds, smaller surface area, and tighter step caps, and saves roughly 40 percent of the cost when the task fits.

What ships clean:

A plan the engineer can read before any edits land
An edit sequence that respects existing patterns in the repo (style, structure, test conventions)
A test run after the edits with explicit failure handling, not silent skip
A clean stop, with no looping retries past the agreed step budget

The ceiling appears at long-horizon autonomy without supervision. The model can plan and edit; it can't yet own the architectural choice that decides whether the work was worth doing. The named failure mode is loop decay: the model keeps trying variants past the point where a human would stop, racks up retry tokens, and either lands a worse answer or no answer at all. Anthropic's Claude Code data shows the average enterprise developer-day cost at $13 with 90 percent under $30; the team that hits $50 a day is paying for loop decay, not loop depth.

If you start this week, pick one autonomous-code endpoint, set an explicit step budget and a token-cost ceiling per task, route to Opus 4.8 or GPT-5.5, and review the first 20 completed tasks for loop discipline before scaling.

Examples of what this looks like in production:

Task	Step budget	Tier
Add a feature across 3 to 6 files with tests	30 steps, $5 ceiling	Claude Opus 4.8
Track down a flaky test across the test suite	50 steps, $10 ceiling	GPT-5.5
Convert a legacy module to a new pattern	40 steps, $7 ceiling	Sonnet 4.6 with strict bounds

Open-Ended Generation

The work is producing prose with voice, structure, and audience fit. Long-form drafts, marketing pages, creative briefs, internal memos, newsletter copy, blog posts.

The slot belongs to the mid tier as the default, with the frontier reserved for thought leadership where the editor's time is the binding cost. Claude Sonnet 4.6 is the strongest mid for voice-aware drafting; GPT-5 mini and Gemini 2.5 Flash hold the slot for templated copy. Opus 4.8 and GPT-5.5 earn their slots only when the editor would otherwise spend more than the model-cost differential rewriting the draft.

What ships clean:

A draft with a coherent voice, an intentional structure, and audience-appropriate language
A factual baseline the editor can fact-check in under twice the writing time
A length within 10 percent of the requested target
An editor-rework time the team is willing to defend against alternative production paths (human-authored, partly automated, fully outsourced)

The ceiling appears when the work needs taste, point of view, or original argumentation. The model can compose; it can't yet write something a reader will remember. The named failure mode is competent mush: the model produces a draft that scans clean and says nothing.

If you start this week, pick one drafting endpoint, score 30 outputs against editor-rework time and an audience-fit rubric, and choose the cheapest tier that holds the editor's time within tolerance.

Examples of what this looks like in production:

Output	Audience	Tier
Templated lifecycle email	Existing customer	Gemini 2.5 Flash or GPT-5 mini
Product launch blog post	Existing audience plus prospects	Claude Sonnet 4.6
Editorial argument with a position	Practitioner audience	Sonnet 4.6 or Opus 4.8

Tool-Using Workflow Agents

The work is the agent picking which tool to call, in what order, with what inputs, and when to stop. The agent has a constrained toolbelt, a step budget, and a stopping rule.

The slot belongs to the mid tier for most production agent loops, and the frontier for loops where side-effect cost is high. Claude Sonnet 4.6 holds the default at the lowest credible mid-tier price; Opus 4.8 or GPT-5.5 earn the slot when one wrong tool call is expensive (a CRM write, a payment action, a customer-facing email).

What ships clean:

Correct tool selection on the first try in over 90 percent of cases, measured against an eval set the team owns
Tool-call arguments that pass schema validation without a retry
A bounded retry policy that escalates to a human rather than looping on the same failure
A deduplication or idempotency guarantee on side-effectful tool calls, so a duplicate call doesn't post a duplicate refund or send a duplicate email

The ceiling appears when the loop length grows past the model's reliable tool-discipline horizon. Past 10 to 15 turns, even frontier models lose the thread on side-effect tracking. The named failure mode is the compounding side effect: one wrong tool call poisons every step that follows, and the loop drives a cleanup operation that costs more than the agent saved.

If you start this week, pick one agent endpoint, run a 50-case eval focused on tool-selection accuracy and argument validity, and set the agent's step budget plus token ceiling before any production traffic.

Examples of what this looks like in production:

Loop	Tools	Tier
Customer-refund eligibility agent	Get account, get policy, search exceptions, draft response	Claude Sonnet 4.6
Vendor-onboarding agent	Search vendor registry, draft questionnaire, send for review	Opus 4.8 or GPT-5.5
Internal-research utility agent	Search, summarize, post to Slack	Gemini 2.5 Flash

Vision, Document Layout, and Screenshot Understanding

The work is interpreting images, documents, or screenshots. Reading invoices, extracting tables, understanding UI state, classifying product images, interpreting receipts.

The slot belongs to the mid tier for layout-aware work and the efficiency tier for simple image classification. Gemini 2.5 Flash holds the default for layout, tables, and UI interpretation at $0.30 input and $2.50 output per million. Flash-Lite earns the slot for low-risk image classification. Sonnet 4.6 holds the slot for complex document reasoning where the model has to argue across the layout.

What ships clean:

A correct region or layout interpretation, with bounding-box accuracy the downstream system can use
Table extraction with the right cells in the right rows and columns
A confidence signal or refusal when the image is ambiguous
A cost-per-page or cost-per-image the team can defend against OCR-only or fully-manual alternatives

The ceiling appears on small text, tight layout, or charts where the spatial relationship is the meaning. Cheap vision can label "this is a chart"; it can miss what the chart actually shows. The named failure mode is the spatial miss: the model gets the gist and loses the data point.

If you start this week, pick one vision endpoint, build a 50-image eval against ground-truth labels and bounding boxes, and route layout-aware work to Gemini 2.5 Flash while keeping Flash-Lite for the simple image classification.

Examples of what this looks like in production:

Input	Output	Tier
Scanned invoice	Line-item table with vendor, amount, GL code	Gemini 2.5 Flash
Product photo from a marketplace listing	Category label plus condition score	Gemini 2.5 Flash-Lite
Customer-support screenshot of a broken UI	Diagnosis with referenced element and action	Claude Sonnet 4.6

Hard Reasoning With Explicit Deliberation

The work is math, planning, root-cause analysis, multi-step derivation. The model has to think before it answers, and the answer depends on the thinking holding up under paraphrase.

The slot belongs to the reasoning tier. Gemini 2.5 Pro is the price-to-frontier leader when the answer fits its context handling. o4-mini holds the slot for OpenAI-native teams at $1.10 input and $4.40 output per million. GPT-5.5 and Opus 4.8 earn their slots only when error cost dominates the cost differential.

What ships clean:

A correct answer that holds when the prompt is paraphrased, and an explicit chain of reasoning the team can audit
A stable answer under input variation (the model doesn't pick a different conclusion when irrelevant details change)
An abstention or "I'm not sure" when the inputs don't justify a confident answer
A latency the consumer can tolerate (these models are slow by design)

The ceiling appears when the prompt sounds hard but the answer is actually shallow, or when the answer requires real-world data the model doesn't have. The named failure mode is confident wrong reasoning: the model produces a clean derivation that's internally consistent and externally wrong, and the audit trail can't catch it without domain knowledge.

If you start this week, pick one reasoning endpoint where the wrong answer has measurable cost, build a 20-example eval with paraphrase variants, and validate Gemini 2.5 Pro or o4-mini against the frontier before paying the frontier premium.

Examples of what this looks like in production:

Question	Output	Tier
Reconcile a tax classification across jurisdictions	Recommendation with cited rules	Gemini 2.5 Pro
Root-cause an incident across logs and metrics	Hypothesis ranking with evidence	o4-mini or GPT-5.5
Plan a complex multi-party deal structure	Step plan with risk callouts	Claude Opus 4.8

What the Human Owns Regardless of Vendor

The engineering manager owns the task classification. A router can encode the routing decision; it can't tell the team which class a call belongs to. The classification gets written down per endpoint, lives in version control next to the prompt, and gets reviewed when the prompt or the consumer changes.

The function owner sets the failure tolerance per class. A 1 percent misroute might be fine on content tagging and catastrophic on refunds, compliance flags, or account deletion. The tolerance gets agreed before the eval is built, and the eval scores against the tolerance, not against an abstract "quality" number.

Finance owns the cost ceiling per task class. "$30 per million output tokens" doesn't mean anything until it's translated into the team's monthly bill at the team's actual call volume. The ceiling lives as a number the team can defend in a budget review, not as a vendor's pricing page.

The applied-AI lead owns the eval harness. Downgrades are only safe when regressions are measured, not felt. The eval set is private (not the vendor's benchmark), refreshes on a cadence shorter than the vendor's release cycle, and scores against the team's actual data with the team's actual rubric.

Engineering owns the vendor-exit plan, because deprecations and repricing are normal. Mistral Large 2 is retired, DeepSeek V3 aliases are deprecated, and OpenAI GPT-5 nano now defers to GPT-5.4 nano. The team's call sites either abstract the model behind a router or pay rewrite costs every quarter, and the choice should be deliberate.

The on-call owns the retry and loop budget. Agent costs explode through retries, tool calls, and repeated context stuffing, not through the first model call. Every agent endpoint has a step budget, a token ceiling, and an escalation path, and the on-call has the authority to lower any of them when the budget breaks.

The team that hands off classification to the vendor stops owning the routing. The vendor sells the tier it can charge for, and the team is the only entity that can decide which tier each call actually deserves.

Cost Calculus and Coexistence

Cheap on the model line isn't cheap in production. A free or near-free model wired to an unbounded agent can produce a four-figure monthly bill before anyone notices. A frontier model on the wrong endpoint can compound the bill until a finance review forces a conversation that should have happened in week one. The math below assumes a small-to-mid team running between 100,000 and 10,000,000 model calls a month; enterprise estates push every number up.

Start with the model input and output token line. Pricing sampled June 1, 2026, per million tokens:

Model	Input	Output
Claude Opus 4.8	$5.00	$25.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
GPT-5.5	$5.00	$30.00
GPT-5	$1.25	$10.00
GPT-5 mini	$0.25	$2.00
GPT-5.4 nano	$0.20	$1.25
Gemini 2.5 Pro (≤ 200K context)	$1.25	$10.00
Gemini 2.5 Pro (> 200K context)	$2.50	$15.00
Gemini 2.5 Flash	$0.30	$2.50
Gemini 2.5 Flash-Lite	$0.10	$0.40
DeepSeek V4 Flash	$0.14	$0.28
Mistral Large 3	$0.50	$1.50

Then count the costs the per-token math hides:

Cached input tokens, which can cut input cost by 50 to 90 percent on workloads that share prompt prefixes, when the cache hit rate is actually measured
Tool fees, which can dominate the bill on agent loops (OpenAI web search at $10 to $25 per 1,000 calls, Google Search grounding at $35 per 1,000 grounded prompts after the free tier)
Retry storms, where a tool failure or schema mismatch drives 5 to 20 redundant calls per task
Routing-gateway costs (LiteLLM self-hosted ops time, OpenRouter or Portkey or Helicone or Vellum platform fees) when the gateway becomes a production dependency
Eval and observability cost (LangSmith, Braintrust, Promptfoo, Vellum subscriptions plus judge-model inference)
Migration cost when the team picks the wrong tier and has to rebuild the prompt and eval against a different model

Five coexistence patterns capture most production setups:

Single-vendor frontier-only: best for very small teams or very high-stakes single-purpose products. Pick Anthropic, OpenAI, or Google, run everything on Sonnet/GPT-5/Flash plus the frontier model when it matters. Spend lands at $500 to $5,000 a month at low volume; the routing question gets deferred until volume forces it.
Two-vendor balanced stack: best for most production teams. Anthropic (Opus, Sonnet, Haiku) plus Google (Gemini Pro, Flash, Flash-Lite) covers the full tier ladder with two billing relationships and two API shapes. Spend lands at $1,000 to $15,000 a month; the routing decision lives in the team's call sites or in a thin gateway.
Three-vendor full stack with a router: best for engineering-led teams with five or more task classes and meaningful monthly spend. OpenAI plus Anthropic plus Google plus a router (LiteLLM, OpenRouter, Portkey) covers every slot, adds fallback, and centralizes billing visibility. Spend lands at $5,000 to $50,000 a month plus the gateway dependency.
Open-weight plus frontier hybrid: best for teams with cost pressure on the long tail. DeepSeek V4 Flash, Llama 3.3 70B, or Qwen 2.5 via OpenRouter on the efficiency tier; Anthropic or OpenAI on the frontier for the work that matters. Spend can drop 40 to 70 percent on volume work, with the offset being vendor-risk and tool-maturity tradeoffs.
Specialist replacements: best for teams with a clear workload that a specialist beats. Mistral OCR for scanned documents, Codestral or Devstral for narrow code, Gemini image-output models for image generation. Specialists earn their slot when they replace a general call cleanly; they don't earn their slot as a hedge.

Two vendors earn their seats when one handles the frontier work and the other carries the efficiency-tier volume at a price the frontier vendor can't match. They don't earn their seats when the second vendor exists because nobody's classified the work and "having options" feels safer than picking the right tier.

Pitfalls and Anti-Patterns

Defaulting to Frontier Because Quality Matters

Pinning every endpoint to GPT-5.5, Opus 4.8, or Gemini 2.5 Pro because the team doesn't have time to test downgrades. The bill compounds without the quality bonus the team thinks it's buying, because most production work is extraction, classification, or summarization, where the frontier model and the efficiency model produce indistinguishable output. The fix is the eval harness: build the test set, run both tiers, route by the result.

Defaulting to Cheapest Because Tokens Are Tokens

Routing everything through Haiku 4.5, Flash-Lite, or GPT-5 nano because the per-million numbers are small. The savings on the model line get eaten by editor rework, customer escalations, and the silent quality regression that takes a quarter to surface. The fix is the same eval harness, applied in the other direction: confirm the cheap model holds the bar on the work that matters, and escalate the work where it doesn't.

Routing on Token Count Instead of Task Class

Sending long-context calls to the frontier and short-context calls to the efficiency tier because token count looks like a proxy for complexity. A 100K-token document extraction may be easier than a 2K-token root-cause analysis. Context length is a cost driver, not a difficulty signal. The fix is to classify the work, then size the context against the class, not the other way around.

Counting Cost at the Model Call and Ignoring Tool Loops

Building an agent on Sonnet 4.6 and budgeting the model spend without budgeting OpenAI web search at $25 per 1,000 calls or Google Search grounding at $35 per 1,000. Tool fees dominate the bill on heavy-loop agents, and the model token cost can be the smaller line item. The fix is per-task cost ceilings that include tool fees and retry counts, not just model tokens.

Hard-Coding Model IDs at the Call Site

Writing the literal claude-opus-4-7 string directly in the application code. Mistral Large 2 is retired; DeepSeek's older aliases are being deprecated; OpenAI's GPT-5 nano now defers to GPT-5.4 nano. The model ID is going to change, repeatedly, and the change shouldn't be a code-change event every time. The fix is to abstract the model behind a routing alias the team owns: "extraction-default" maps to the current efficiency-tier choice, and the call site doesn't change when the underlying model does.

Treating Structured-Output Support as Structured-Output Reliability

Picking a model because the vendor docs say it supports function calling and JSON mode, and shipping without testing the actual rate at which the output validates on the team's schema. Vendor support means the feature exists; it doesn't mean the model produces valid output under adversarial inputs, optional fields, deeply nested objects, or malformed user prompts. The fix is a schema-validation eval that runs as part of the routing decision.

Using Reasoning Models as Fast Models

Routing chat-style interactions to o3 because "reasoning is better." o3 is labeled slow by OpenAI's own docs and is succeeded by GPT-5 for most general use. Reasoning models are the right tier when the answer depends on a derivation; they're the wrong tier when the answer needs to ship in under a second. The fix is to reserve the reasoning tier for the work where the explicit deliberation is the deliverable.

What to Validate Before Paying for the Stack

The pilot below tests the routing decisions against real production traffic, not against a vendor demo. It produces measurable pass-fail gates and a defensible monthly-spend number.

Before day one. Build the eval sets: 50 to 200 examples per task class on the top three by volume and top two by failure cost. Decide the failure tolerance per class, in writing, with the function owner. Pick one routing surface (gateway or vendor SDK with a routing module) the team is willing to depend on.

Week one: classify, eval, decide. Tag a representative week of production calls by class. Run the eval against the current model and against the candidate downgrade or upgrade per class. Produce a routing-decision spreadsheet with the per-class quality delta, the per-million cost delta, and the recommendation. The Friday output is a routing config the team is ready to ship, not a memo about routing in the abstract.

Week two: ship and watch. Route each class to the chosen default, and hold the eval daily. Monitor latency, validation failure rate, retry count, tool-fee accumulation, and operator-visible quality on a dashboard the team checks every morning. Treat the first 48 hours as the rollback window, and revert immediately if any class breaks its tolerance.

Buy only if the loop wins. The pilot passes only when these gates hold:

Every class meets its declared failure tolerance and stays under its declared monthly ceiling
Validation failure rate, retry count, and tool-fee accumulation match the pre-pilot baseline or improve
Operators can diagnose a wrong answer in under five minutes by tracing the call to its class, model, and prompt
The monthly bill, projected from the first two weeks, is at least 20 percent below the pre-pilot trajectory on the volume classes without quality regression on the judgment classes
The team has a rollback path that returns any class to the prior model within an hour

Fail the pilot if the routing gateway can't:

Show which model handled a specific call, with the input, output, latency, and cost
Cap spend at the class level, not just the vendor level
Document fallback behavior when the primary model returns an error or hits a rate limit
Stream logs to the team's existing observability layer
Produce a clear answer on data residency, model versioning, and rollback for the gateway's own service

Key Takeaways

The decision is the task class, not the model. Frontier overspend and cheap-model regression are the same mistake: not classifying the work before picking the tier.
Ten task classes cover most production AI: structured extraction, classification, constrained summarization, multi-source synthesis, shallow code assist, autonomous code, open-ended generation, tool-using agents, vision and document layout, hard reasoning with deliberation.
The efficiency tier (Gemini Flash-Lite, Haiku 4.5, GPT-5.4 nano, DeepSeek V4 Flash) earns extraction, classification, internal summarization, and shallow code. The mid tier (Sonnet 4.6, GPT-5, Gemini 2.5 Flash) earns customer-facing summary, synthesis, generation, agents, and vision. The frontier (Opus 4.8, GPT-5.5, Gemini 2.5 Pro, o-series) earns autonomous code, high-risk agents, and hard reasoning.
Cost per task class is the only meter worth tracking. Per-million-token numbers are vendor artifacts; per-task class spend is the operator's budget.
Two vendors earn their seats when one handles the frontier and the other carries the efficiency volume. They don't earn their seats when the second vendor exists because nobody's classified the work.
Hard-coding model IDs at the call site is the deprecation tax waiting to happen. Route through an alias the team owns, and the model change becomes a config change, not a code change.
The work the platform can't do for the team is classifying the task class, declaring the failure tolerance, setting the cost ceiling, owning the eval harness, holding the vendor-exit plan, and bounding the retry and tool budget.
The named failure modes worth memorizing: frontier-by-default, the cheap-model regression, one-tier-fits-all, the silent-coercion error, the overconfident misroute, the false reconciliation, the looks-right-locally edit, loop decay, competent mush, the compounding side effect, the spatial miss, confident wrong reasoning.

Methodology

Declared frame: the right model is the cheapest one that doesn't fail the task, and the team's job is to classify the work so the routing decision can be made and defended. The dossier maps ten task classes against the model tier that earns each one, layers in pricing math sampled June 1, 2026, and treats vendor selection as a downstream decision rather than the central one. Sources consulted: vendor pricing pages and model documentation for Anthropic, OpenAI, Google, Mistral, DeepSeek, Meta, and Alibaba; routing-gateway documentation for LiteLLM, OpenRouter, Portkey, Helicone, and Vellum; cross-provider benchmarks from Artificial Analysis; published customer cost data from Anthropic (Claude Code) and OpenAI (GPT-5.5 finance workflow), plus published Intercom Fin outcome-pricing data for product comparison framing. Pricing and feature claims reflect a June 1, 2026 snapshot and shift as vendors revise plans, models, and tiers. In scope: per-task-class model selection for production AI stacks running between 100,000 and 10,000,000 model calls a month across the standard chat, agent, code, vision, and reasoning workloads. Above 10,000,000 calls a month, the routing decisions still apply, but enterprise commit pricing, custom rate-limit tiers, and self-hosted open-weight options change the cost math non-linearly. Out of scope: deep prompt-engineering technique comparison, fine-tuning economics, dedicated inference deployments, and the agent-platform-versus-workflow-tool decision (covered in the prior dossier on work shape).

Sources

Tools Mentioned

LinkedIn X Email

Classify the Work, Not the Model

TL;DR

Three Teams That Paid for the Same Mistake