Classify the Work, Not the Model
A reference dossier for the engineering manager, applied-AI lead, or founder already paying for two or three model vendors: the ten task classes a team runs every week, the model tier that earns each one, and the routing decisions that decide whether the monthly bill grows linearly with usage or grows like a horror story.

TL;DR
The decision isn't Claude versus GPT versus Gemini. That decision sits in vendor procurement. The decision here is which class the work belongs to, and which tier of model the class can ride on without producing a quality regression a human will catch a month later in a customer-facing channel.
The right model is the cheapest one that doesn't fail the task. Frontier models are the easy default and the most common overspend, while cheap models are the easy savings and the most common silent regression. The fix is to classify the work, not the model.
Ten task classes sit underneath every production AI stack. Structured extraction, classification, constrained summarization, and shallow code assist belong to the efficiency tier. Multi-source synthesis, open-ended generation, and tool-using workflow agents belong to the mid tier. Autonomous code, hard reasoning with explicit deliberation, and high-risk agentic loops belong to the frontier. Vision sits where the multimodal pricing wins by a band.
A starting stack pairs one frontier model the team trusts for high-stakes work with one mid-tier default and one efficiency-tier workhorse for the long tail.
Frontier tier:
- Claude Opus 4.8 for autonomous engineering and high-stakes synthesis
- GPT-5.5 for professional analysis and structured agentic work
- Gemini 2.5 Pro for long-context reasoning when frontier price matters
Mid tier:
- Claude Sonnet 4.6 as the balanced default for synthesis, writing, and agents
- GPT-5 for reasoning-heavy work below the GPT-5.5 ceiling
- Gemini 2.5 Flash for high-volume mid calls and multimodal work
Efficiency tier:
- Claude Haiku 4.5 for fast Claude-compatible extraction and sub-agents
- Gemini 2.5 Flash-Lite for extraction, classification, and simple vision
- GPT-5.4 nano for OpenAI-native cheap defaults
- DeepSeek V4 Flash for extreme-cost workloads with vendor-risk acceptance
Monthly model spend on a small team runs $200 to $3,000 when the routing is honest; the same workload run frontier-by-default lands at $10,000 to $50,000. The gap is the routing.
Classifying the calls is the team's job, not the platform's. A router can encode the decision; it can't make it.
Three Teams That Paid for the Same Mistake
A 30-seat company put GPT-5.5 behind its support triage at a steady three-million-call month. The work was classification: read a ticket, return one of nine labels and a confidence score, hand off below the threshold. The bill came in at $4,200, and the founder asked engineering why a classification task cost more than the customer-success seat. The answer was that nobody had ever picked a model for that endpoint; the first engineer used GPT-5.5 because the docs had a copy-paste example, and the routing never got revisited. Running the same workload on Gemini 2.5 Flash-Lite costs about $200 a month at the published $0.10 input and $0.40 output per million, and the eval set showed no meaningful regression on the existing nine-label taxonomy. The failure moment wasn't a hallucination; it was the line item nobody owned. Call it frontier-by-default.
A content team running an automated draft pipeline on Claude Haiku 4.5 noticed editorial scoring drifting downward across the quarter. Haiku is fast and cheap, and on summary work it's competent. The drafts looked clean on first read; the editors were catching subtle factuality issues two reviews deep, and the editor-rework time per draft crept from 18 minutes to 42 minutes. The savings on the model line ($1.20 input and $5 output per million versus Sonnet's $3 and $15) were under $4,000 a month. The lost editor time, measured against fully-loaded hourly cost, was $11,000. The team had downgraded a synthesis-and-voice job into a class that only earns the slot when the output is templated. The fix was to escalate authored work back to Sonnet 4.6 and leave Haiku on the constrained-summary work where it still ships clean. Call it the cheap-model regression.
A 200-engineer organization ran Claude Code at scale, on a flat policy that routed every loop through Opus 4.8 because "coding is high stakes." Anthropic's own published Claude Code data shows enterprise deployments average around $13 per developer per active day with 90 percent below $30. Run the math on 200 developers and 20 active days: about $52,000 a month, with the curve climbing as autonomy scales. The team's actual bill cleared $70,000 because they hadn't separated inline completion (which doesn't need Opus), function-level edits (which mostly don't need Opus), and autonomous multi-file engineering (which is where Opus earns its slot). Once they split shallow code assist down to GPT-5 mini and held Opus for the planning-and-multi-file loops, the bill came back inside the $25,000 to $30,000 range without a measurable regression on the work that mattered. Call it one-tier-fits-all.
Three teams, three different mistakes, one category. They picked the model before they classified the work, and then they let the choice ossify because nobody owned the routing as a recurring decision rather than a one-time setup.
The lifecycle anchor is the model-deprecation cycle. Mistral Large 2 is retired, DeepSeek's older deepseek-chat and deepseek-reasoner aliases are being deprecated in favor of V4 Flash and V4 Pro, and OpenAI's GPT-5 nano now sits behind GPT-5.4 nano for most new speed-sensitive workloads. Hard-coding a model ID without a deprecation plan is its own failure mode; the routing layer needs to abstract the model from the call site, or the call site has to be renamed every quarter.
The Ten Classes and the Tiers That Earn Them
Each sub-case below follows the same template: the slot names the default tier, the "what ships clean" bullets describe the deliverable, the ceiling names the failure mode, and the action line is the Friday deliverable.
Structured Extraction and Schema Filling
The work is pulling named fields from documents, emails, transcripts, or chat logs against a known schema. Inputs and outputs are deterministic; the model is doing parsing, not judgment.
The slot belongs to the efficiency tier. Gemini 2.5 Flash-Lite is the price-leader default at $0.10 input and $0.40 output per million; DeepSeek V4 Flash is close behind when vendor risk is acceptable. GPT-5.4 nano holds the slot for teams already running OpenAI-native. Mistral OCR sits in front of the LLM call when the document is scanned and layout matters more than language.
What ships clean:
- A JSON object that parses on the first attempt, with every required field present and every optional field either filled or explicitly null
- A confidence or abstention signal the team can route on, not a silent guess
- A validation step that rejects malformed output and routes failures to a retry on the same tier before escalating
- A per-million cost the team can defend against the alternative (manual extraction, OCR-only, regex-only)
The ceiling appears at extraction where the schema is genuinely ambiguous. When the model has to decide whether "Acme, Inc." and "Acme Inc" are the same legal entity, or which of two phone numbers in the document is the billing contact, the work has graduated to multi-source synthesis. The named failure mode is the silent-coercion error: the cheap model produces a parseable record with the wrong values.
If you start this week, pick one extraction endpoint with at least 100,000 monthly calls. Build a 100-example eval against your current model, run the same eval on Gemini Flash-Lite, and ship the swap by Friday if the regression is under your tolerance.
Examples of what this looks like in production:
| Source | Fields | Tier |
|---|---|---|
| Inbound invoice email | Vendor, amount, due date, line items, PO reference | Gemini 2.5 Flash-Lite |
| Customer signup form | Name, email, role, company size, intent | GPT-5.4 nano |
| Scanned contract pages | Parties, effective date, term, renewal clause | Mistral OCR plus Haiku 4.5 |
Classification and Routing
The work is putting an inbound item into one of a small set of labels: support intent, lead temperature, ticket category, expense category, content type. Judgment is real but bounded; the taxonomy is small enough to enumerate.
The slot belongs to the efficiency tier. Gemini 2.5 Flash-Lite is the default at $0.10 input and $0.40 output per million; GPT-5 nano and Llama 3.3 70B via OpenRouter compete on cost when the label set is stable and the eval clears them.
What ships clean:
- A label, a confidence score, and a one-sentence rationale, returned as structured JSON
- A taxonomy with fewer than 15 labels, with three to five labeled examples per class plus an "unknown / needs human" class
- A confidence threshold the team can tune, with everything below the threshold queued for human review
- A weekly misroute review where production examples that the model labeled wrong get added to the eval set
The ceiling appears when the team expands the taxonomy to cover every edge case. Past 15 labels, the work is investigation, not classification, and the slot has moved up a tier. The named failure mode is the overconfident misroute on the ambiguous intent: the cheap model picks a confident wrong label because the prompt didn't give it a way to abstain.
If you start this week, pull 200 real labeled items, run them through the cheapest model that supports structured output, publish precision and recall by label, and route only above 0.85 confidence by Friday.
Examples of what this looks like in production:
| Inbound | Labels | Tier |
|---|---|---|
| Support tickets | refund / bug / how-to / sales / spam / unknown | Gemini 2.5 Flash-Lite |
| Inbound leads | hot / warm / cold / out-of-ICP / spam | GPT-5 nano |
| Expense receipts | meals / travel / software / contractor / other | Haiku 4.5 |
Constrained Summarization and Rewrite
The work is producing a summary at a known length, tone, and structure. Customer-visible ticket summaries, email drafts of fixed shape, daily stand-up notes from raw transcripts, meeting recaps in a template the team has settled on.
The slot belongs to the mid tier when the output is customer-visible, and the efficiency tier when the output is internal-only. Gemini 2.5 Flash earns customer-facing summaries at $0.30 input and $2.50 output per million; Haiku 4.5 holds the slot when the tone has to feel Claude-shaped. GPT-5 mini works when the team is OpenAI-native and the editor doesn't need a particular voice.
What ships clean:
- A summary at the requested length, tone, and structure, with no invented quotes or fabricated names
- Factual coverage the team can verify against the source in under two minutes
- An editor-rework time on first review under the bar the team agreed to in week one
- A consistent format that downstream systems (CRM fields, email templates, ticket panels) can consume without manual cleanup
The ceiling appears when the summary has to take a position the source doesn't take. Compressing a customer call into a sales rec, condensing three meetings into a roadmap update, or distilling a thread into an executive view requires synthesis, not summarization. The named failure mode is the cheap-model regression: the model produces competent mush that passes a glance and fails the second read.
If you start this week, pick one summary endpoint with at least 1,000 monthly calls, build a 50-example eval with editor-rework time as the metric, and ship the swap by Friday if the time is within tolerance.
Examples of what this looks like in production:
| Input | Output | Tier |
|---|---|---|
| Sales call transcript | Two-paragraph CRM activity note | Gemini 2.5 Flash |
| Inbound support ticket thread | Three-line summary for the agent panel | Haiku 4.5 |
| Weekly engineering Slack threads | Bulleted stand-up note for the PM | GPT-5 mini |
Multi-Source Synthesis and Reconciliation
The work is reading across multiple sources, identifying agreement and conflict, and producing a judgment. Reconcile an invoice against contract terms. Compare a vendor pitch against the team's evaluation rubric. Cross-check a customer's stated issue against their account history and the product's release notes.
The slot belongs to the mid tier as the default and the frontier when the answer drives a high-stakes decision. Claude Sonnet 4.6 at $3 input and $15 output per million is the workhorse. Gemini 2.5 Pro and GPT-5 sit at similar mid-frontier pricing for cost-sensitive variants, and Opus 4.8 holds the slot when the team needs the deeper synthesis Sonnet doesn't quite reach.
What ships clean:
- A reconciliation that names each source, identifies points of agreement and disagreement, and states uncertainty where the sources don't decide
- A bounded output schema that downstream systems can consume (a discrepancy list with evidence links, a structured comparison table, a recommendation with rationale)
- A human-readable rationale that a reviewer can audit without re-reading every source
- An eval that scores false reconciliation (the model makes inconsistent sources look consistent) higher than other failure modes, because that failure is the one the human will miss
The ceiling appears when the synthesis requires multi-step reasoning the model can't show its work for. When the answer depends on a derivation, an explicit plan, or a high-stakes judgment, the work has moved into hard reasoning with explicit deliberation. The named failure mode is the false reconciliation: the model paves over a real conflict because the prompt asked for a single answer.
If you start this week, pick one synthesis endpoint with high decision-cost-per-output, build a 30-example eval scored by a human on conflict-detection and source-attribution accuracy, and validate Sonnet 4.6 as the default before considering frontier escalation.
Examples of what this looks like in production:
| Sources | Output | Tier |
|---|---|---|
| Invoice, contract, prior payments | Discrepancy report with evidence links | Claude Sonnet 4.6 |
| Vendor questionnaire, internal policy, prior audit | Risk score plus reviewer packet | GPT-5 or Gemini 2.5 Pro |
| Customer ticket, account history, release notes | Recommended response with cited rules | Sonnet 4.6 |
Shallow Code Assist
The work is inline completion, function-level edits, small test additions, localized diffs. The model is helping the engineer move within a file or a small set of files; the engineer owns the loop.
The slot belongs to the mid tier of code-specialized models. GPT-5 mini at $0.25 input and $2 output per million is strong; Codestral and Devstral hold their seats when the team prefers an open-weight option; DeepSeek V4 Flash is the price floor when vendor risk is acceptable.
What ships clean:
- A diff that compiles, passes the existing tests, and follows the repo's style without manual cleanup
- A response time under the bar that keeps the engineer in flow (under two seconds for inline completion, under ten seconds for function-level edits)
- A failure mode where the engineer can see the suggestion is wrong before applying it, not after
- A cost per accepted suggestion that the team can defend against engineering time saved
The ceiling appears when the edit needs to understand more than one file. The moment the suggestion has to reason about an import chain, a database schema, or a service boundary, the work has moved into autonomous code and multi-file engineering. The named failure mode is the looks-right-locally edit: the model produces a change that compiles in isolation and breaks an adjacent call site.
If you start this week, pick one IDE assistant endpoint, instrument suggestion acceptance rate by file size and edit scope, and route inline completion to GPT-5 mini while reserving the frontier for the multi-file loop.
Examples of what this looks like in production:
| Edit shape | Target | Tier |
|---|---|---|
| Inline completion in a function body | Repo-aware completion under 200 ms | GPT-5 mini |
| Function-level refactor in one file | Working diff with passing tests | Codestral or GPT-5 mini |
| Test scaffolding for an existing function | Spec file with realistic assertions | DeepSeek V4 Flash |
Autonomous Code and Multi-File Engineering
The work is the loop. The model plans, navigates the repo, edits multiple files, runs tests, debugs, and decides when to stop. The engineer reviews the result; the model owns the path.
The slot belongs to the frontier. GPT-5.5 and Claude Opus 4.8 are the only models that consistently hold the loop without thrashing on the harder cases. Sonnet 4.6 holds it with stricter bounds, smaller surface area, and tighter step caps, and saves roughly 40 percent of the cost when the task fits.
What ships clean:
- A plan the engineer can read before any edits land
- An edit sequence that respects existing patterns in the repo (style, structure, test conventions)
- A test run after the edits with explicit failure handling, not silent skip
- A clean stop, with no looping retries past the agreed step budget
The ceiling appears at long-horizon autonomy without supervision. The model can plan and edit; it can't yet own the architectural choice that decides whether the work was worth doing. The named failure mode is loop decay: the model keeps trying variants past the point where a human would stop, racks up retry tokens, and either lands a worse answer or no answer at all. Anthropic's Claude Code data shows the average enterprise developer-day cost at $13 with 90 percent under $30; the team that hits $50 a day is paying for loop decay, not loop depth.
If you start this week, pick one autonomous-code endpoint, set an explicit step budget and a token-cost ceiling per task, route to Opus 4.8 or GPT-5.5, and review the first 20 completed tasks for loop discipline before scaling.
Examples of what this looks like in production:
| Task | Step budget | Tier |
|---|---|---|
| Add a feature across 3 to 6 files with tests | 30 steps, $5 ceiling | Claude Opus 4.8 |
| Track down a flaky test across the test suite | 50 steps, $10 ceiling | GPT-5.5 |
| Convert a legacy module to a new pattern | 40 steps, $7 ceiling | Sonnet 4.6 with strict bounds |
Open-Ended Generation
The work is producing prose with voice, structure, and audience fit. Long-form drafts, marketing pages, creative briefs, internal memos, newsletter copy, blog posts.
The slot belongs to the mid tier as the default, with the frontier reserved for thought leadership where the editor's time is the binding cost. Claude Sonnet 4.6 is the strongest mid for voice-aware drafting; GPT-5 mini and Gemini 2.5 Flash hold the slot for templated copy. Opus 4.8 and GPT-5.5 earn their slots only when the editor would otherwise spend more than the model-cost differential rewriting the draft.
What ships clean:
- A draft with a coherent voice, an intentional structure, and audience-appropriate language
- A factual baseline the editor can fact-check in under twice the writing time
- A length within 10 percent of the requested target
- An editor-rework time the team is willing to defend against alternative production paths (human-authored, partly automated, fully outsourced)
The ceiling appears when the work needs taste, point of view, or original argumentation. The model can compose; it can't yet write something a reader will remember. The named failure mode is competent mush: the model produces a draft that scans clean and says nothing.
If you start this week, pick one drafting endpoint, score 30 outputs against editor-rework time and an audience-fit rubric, and choose the cheapest tier that holds the editor's time within tolerance.
Examples of what this looks like in production:
| Output | Audience | Tier |
|---|---|---|
| Templated lifecycle email | Existing customer | Gemini 2.5 Flash or GPT-5 mini |
| Product launch blog post | Existing audience plus prospects | Claude Sonnet 4.6 |
| Editorial argument with a position | Practitioner audience | Sonnet 4.6 or Opus 4.8 |
Tool-Using Workflow Agents
The work is the agent picking which tool to call, in what order, with what inputs, and when to stop. The agent has a constrained toolbelt, a step budget, and a stopping rule.
The slot belongs to the mid tier for most production agent loops, and the frontier for loops where side-effect cost is high. Claude Sonnet 4.6 holds the default at the lowest credible mid-tier price; Opus 4.8 or GPT-5.5 earn the slot when one wrong tool call is expensive (a CRM write, a payment action, a customer-facing email).
What ships clean:
- Correct tool selection on the first try in over 90 percent of cases, measured against an eval set the team owns
- Tool-call arguments that pass schema validation without a retry
- A bounded retry policy that escalates to a human rather than looping on the same failure
- A deduplication or idempotency guarantee on side-effectful tool calls, so a duplicate call doesn't post a duplicate refund or send a duplicate email
The ceiling appears when the loop length grows past the model's reliable tool-discipline horizon. Past 10 to 15 turns, even frontier models lose the thread on side-effect tracking. The named failure mode is the compounding side effect: one wrong tool call poisons every step that follows, and the loop drives a cleanup operation that costs more than the agent saved.
If you start this week, pick one agent endpoint, run a 50-case eval focused on tool-selection accuracy and argument validity, and set the agent's step budget plus token ceiling before any production traffic.
Examples of what this looks like in production:
| Loop | Tools | Tier |
|---|---|---|
| Customer-refund eligibility agent | Get account, get policy, search exceptions, draft response | Claude Sonnet 4.6 |
| Vendor-onboarding agent | Search vendor registry, draft questionnaire, send for review | Opus 4.8 or GPT-5.5 |
| Internal-research utility agent | Search, summarize, post to Slack | Gemini 2.5 Flash |
Vision, Document Layout, and Screenshot Understanding
The work is interpreting images, documents, or screenshots. Reading invoices, extracting tables, understanding UI state, classifying product images, interpreting receipts.
The slot belongs to the mid tier for layout-aware work and the efficiency tier for simple image classification. Gemini 2.5 Flash holds the default for layout, tables, and UI interpretation at $0.30 input and $2.50 output per million. Flash-Lite earns the slot for low-risk image classification. Sonnet 4.6 holds the slot for complex document reasoning where the model has to argue across the layout.
What ships clean:
- A correct region or layout interpretation, with bounding-box accuracy the downstream system can use
- Table extraction with the right cells in the right rows and columns
- A confidence signal or refusal when the image is ambiguous
- A cost-per-page or cost-per-image the team can defend against OCR-only or fully-manual alternatives
The ceiling appears on small text, tight layout, or charts where the spatial relationship is the meaning. Cheap vision can label "this is a chart"; it can miss what the chart actually shows. The named failure mode is the spatial miss: the model gets the gist and loses the data point.
If you start this week, pick one vision endpoint, build a 50-image eval against ground-truth labels and bounding boxes, and route layout-aware work to Gemini 2.5 Flash while keeping Flash-Lite for the simple image classification.
Examples of what this looks like in production:
| Input | Output | Tier |
|---|---|---|
| Scanned invoice | Line-item table with vendor, amount, GL code | Gemini 2.5 Flash |
| Product photo from a marketplace listing | Category label plus condition score | Gemini 2.5 Flash-Lite |
| Customer-support screenshot of a broken UI | Diagnosis with referenced element and action | Claude Sonnet 4.6 |
Hard Reasoning With Explicit Deliberation
The work is math, planning, root-cause analysis, multi-step derivation. The model has to think before it answers, and the answer depends on the thinking holding up under paraphrase.
The slot belongs to the reasoning tier. Gemini 2.5 Pro is the price-to-frontier leader when the answer fits its context handling. o4-mini holds the slot for OpenAI-native teams at $1.10 input and $4.40 output per million. GPT-5.5 and Opus 4.8 earn their slots only when error cost dominates the cost differential.
What ships clean:
- A correct answer that holds when the prompt is paraphrased, and an explicit chain of reasoning the team can audit
- A stable answer under input variation (the model doesn't pick a different conclusion when irrelevant details change)
- An abstention or "I'm not sure" when the inputs don't justify a confident answer
- A latency the consumer can tolerate (these models are slow by design)
The ceiling appears when the prompt sounds hard but the answer is actually shallow, or when the answer requires real-world data the model doesn't have. The named failure mode is confident wrong reasoning: the model produces a clean derivation that's internally consistent and externally wrong, and the audit trail can't catch it without domain knowledge.
If you start this week, pick one reasoning endpoint where the wrong answer has measurable cost, build a 20-example eval with paraphrase variants, and validate Gemini 2.5 Pro or o4-mini against the frontier before paying the frontier premium.
Examples of what this looks like in production:
| Question | Output | Tier |
|---|---|---|
| Reconcile a tax classification across jurisdictions | Recommendation with cited rules | Gemini 2.5 Pro |
| Root-cause an incident across logs and metrics | Hypothesis ranking with evidence | o4-mini or GPT-5.5 |
| Plan a complex multi-party deal structure | Step plan with risk callouts | Claude Opus 4.8 |
What the Human Owns Regardless of Vendor
The engineering manager owns the task classification. A router can encode the routing decision; it can't tell the team which class a call belongs to. The classification gets written down per endpoint, lives in version control next to the prompt, and gets reviewed when the prompt or the consumer changes.
The function owner sets the failure tolerance per class. A 1 percent misroute might be fine on content tagging and catastrophic on refunds, compliance flags, or account deletion. The tolerance gets agreed before the eval is built, and the eval scores against the tolerance, not against an abstract "quality" number.
Finance owns the cost ceiling per task class. "$30 per million output tokens" doesn't mean anything until it's translated into the team's monthly bill at the team's actual call volume. The ceiling lives as a number the team can defend in a budget review, not as a vendor's pricing page.
The applied-AI lead owns the eval harness. Downgrades are only safe when regressions are measured, not felt. The eval set is private (not the vendor's benchmark), refreshes on a cadence shorter than the vendor's release cycle, and scores against the team's actual data with the team's actual rubric.
Engineering owns the vendor-exit plan, because deprecations and repricing are normal. Mistral Large 2 is retired, DeepSeek V3 aliases are deprecated, and OpenAI GPT-5 nano now defers to GPT-5.4 nano. The team's call sites either abstract the model behind a router or pay rewrite costs every quarter, and the choice should be deliberate.
The on-call owns the retry and loop budget. Agent costs explode through retries, tool calls, and repeated context stuffing, not through the first model call. Every agent endpoint has a step budget, a token ceiling, and an escalation path, and the on-call has the authority to lower any of them when the budget breaks.
The team that hands off classification to the vendor stops owning the routing. The vendor sells the tier it can charge for, and the team is the only entity that can decide which tier each call actually deserves.
Cost Calculus and Coexistence
Cheap on the model line isn't cheap in production. A free or near-free model wired to an unbounded agent can produce a four-figure monthly bill before anyone notices. A frontier model on the wrong endpoint can compound the bill until a finance review forces a conversation that should have happened in week one. The math below assumes a small-to-mid team running between 100,000 and 10,000,000 model calls a month; enterprise estates push every number up.
Start with the model input and output token line. Pricing sampled June 1, 2026, per million tokens:
| Model | Input | Output |
|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| GPT-5.5 | $5.00 | $30.00 |
| GPT-5 | $1.25 | $10.00 |
| GPT-5 mini | $0.25 | $2.00 |
| GPT-5.4 nano | $0.20 | $1.25 |
| Gemini 2.5 Pro (≤ 200K context) | $1.25 | $10.00 |
| Gemini 2.5 Pro (> 200K context) | $2.50 | $15.00 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 |
| DeepSeek V4 Flash | $0.14 | $0.28 |
| Mistral Large 3 | $0.50 | $1.50 |
Then count the costs the per-token math hides:
- Cached input tokens, which can cut input cost by 50 to 90 percent on workloads that share prompt prefixes, when the cache hit rate is actually measured
- Tool fees, which can dominate the bill on agent loops (OpenAI web search at $10 to $25 per 1,000 calls, Google Search grounding at $35 per 1,000 grounded prompts after the free tier)
- Retry storms, where a tool failure or schema mismatch drives 5 to 20 redundant calls per task
- Routing-gateway costs (LiteLLM self-hosted ops time, OpenRouter or Portkey or Helicone or Vellum platform fees) when the gateway becomes a production dependency
- Eval and observability cost (LangSmith, Braintrust, Promptfoo, Vellum subscriptions plus judge-model inference)
- Migration cost when the team picks the wrong tier and has to rebuild the prompt and eval against a different model
Five coexistence patterns capture most production setups:
- Single-vendor frontier-only: best for very small teams or very high-stakes single-purpose products. Pick Anthropic, OpenAI, or Google, run everything on Sonnet/GPT-5/Flash plus the frontier model when it matters. Spend lands at $500 to $5,000 a month at low volume; the routing question gets deferred until volume forces it.
- Two-vendor balanced stack: best for most production teams. Anthropic (Opus, Sonnet, Haiku) plus Google (Gemini Pro, Flash, Flash-Lite) covers the full tier ladder with two billing relationships and two API shapes. Spend lands at $1,000 to $15,000 a month; the routing decision lives in the team's call sites or in a thin gateway.
- Three-vendor full stack with a router: best for engineering-led teams with five or more task classes and meaningful monthly spend. OpenAI plus Anthropic plus Google plus a router (LiteLLM, OpenRouter, Portkey) covers every slot, adds fallback, and centralizes billing visibility. Spend lands at $5,000 to $50,000 a month plus the gateway dependency.
- Open-weight plus frontier hybrid: best for teams with cost pressure on the long tail. DeepSeek V4 Flash, Llama 3.3 70B, or Qwen 2.5 via OpenRouter on the efficiency tier; Anthropic or OpenAI on the frontier for the work that matters. Spend can drop 40 to 70 percent on volume work, with the offset being vendor-risk and tool-maturity tradeoffs.
- Specialist replacements: best for teams with a clear workload that a specialist beats. Mistral OCR for scanned documents, Codestral or Devstral for narrow code, Gemini image-output models for image generation. Specialists earn their slot when they replace a general call cleanly; they don't earn their slot as a hedge.
Two vendors earn their seats when one handles the frontier work and the other carries the efficiency-tier volume at a price the frontier vendor can't match. They don't earn their seats when the second vendor exists because nobody's classified the work and "having options" feels safer than picking the right tier.
Pitfalls and Anti-Patterns
Defaulting to Frontier Because Quality Matters
Pinning every endpoint to GPT-5.5, Opus 4.8, or Gemini 2.5 Pro because the team doesn't have time to test downgrades. The bill compounds without the quality bonus the team thinks it's buying, because most production work is extraction, classification, or summarization, where the frontier model and the efficiency model produce indistinguishable output. The fix is the eval harness: build the test set, run both tiers, route by the result.
Defaulting to Cheapest Because Tokens Are Tokens
Routing everything through Haiku 4.5, Flash-Lite, or GPT-5 nano because the per-million numbers are small. The savings on the model line get eaten by editor rework, customer escalations, and the silent quality regression that takes a quarter to surface. The fix is the same eval harness, applied in the other direction: confirm the cheap model holds the bar on the work that matters, and escalate the work where it doesn't.
Routing on Token Count Instead of Task Class
Sending long-context calls to the frontier and short-context calls to the efficiency tier because token count looks like a proxy for complexity. A 100K-token document extraction may be easier than a 2K-token root-cause analysis. Context length is a cost driver, not a difficulty signal. The fix is to classify the work, then size the context against the class, not the other way around.
Counting Cost at the Model Call and Ignoring Tool Loops
Building an agent on Sonnet 4.6 and budgeting the model spend without budgeting OpenAI web search at $25 per 1,000 calls or Google Search grounding at $35 per 1,000. Tool fees dominate the bill on heavy-loop agents, and the model token cost can be the smaller line item. The fix is per-task cost ceilings that include tool fees and retry counts, not just model tokens.
Hard-Coding Model IDs at the Call Site
Writing the literal claude-opus-4-7 string directly in the application code. Mistral Large 2 is retired; DeepSeek's older aliases are being deprecated; OpenAI's GPT-5 nano now defers to GPT-5.4 nano. The model ID is going to change, repeatedly, and the change shouldn't be a code-change event every time. The fix is to abstract the model behind a routing alias the team owns: "extraction-default" maps to the current efficiency-tier choice, and the call site doesn't change when the underlying model does.
Treating Structured-Output Support as Structured-Output Reliability
Picking a model because the vendor docs say it supports function calling and JSON mode, and shipping without testing the actual rate at which the output validates on the team's schema. Vendor support means the feature exists; it doesn't mean the model produces valid output under adversarial inputs, optional fields, deeply nested objects, or malformed user prompts. The fix is a schema-validation eval that runs as part of the routing decision.
Using Reasoning Models as Fast Models
Routing chat-style interactions to o3 because "reasoning is better." o3 is labeled slow by OpenAI's own docs and is succeeded by GPT-5 for most general use. Reasoning models are the right tier when the answer depends on a derivation; they're the wrong tier when the answer needs to ship in under a second. The fix is to reserve the reasoning tier for the work where the explicit deliberation is the deliverable.
What to Validate Before Paying for the Stack
The pilot below tests the routing decisions against real production traffic, not against a vendor demo. It produces measurable pass-fail gates and a defensible monthly-spend number.
Before day one. Build the eval sets: 50 to 200 examples per task class on the top three by volume and top two by failure cost. Decide the failure tolerance per class, in writing, with the function owner. Pick one routing surface (gateway or vendor SDK with a routing module) the team is willing to depend on.
Week one: classify, eval, decide. Tag a representative week of production calls by class. Run the eval against the current model and against the candidate downgrade or upgrade per class. Produce a routing-decision spreadsheet with the per-class quality delta, the per-million cost delta, and the recommendation. The Friday output is a routing config the team is ready to ship, not a memo about routing in the abstract.
Week two: ship and watch. Route each class to the chosen default, and hold the eval daily. Monitor latency, validation failure rate, retry count, tool-fee accumulation, and operator-visible quality on a dashboard the team checks every morning. Treat the first 48 hours as the rollback window, and revert immediately if any class breaks its tolerance.
Buy only if the loop wins. The pilot passes only when these gates hold:
- Every class meets its declared failure tolerance and stays under its declared monthly ceiling
- Validation failure rate, retry count, and tool-fee accumulation match the pre-pilot baseline or improve
- Operators can diagnose a wrong answer in under five minutes by tracing the call to its class, model, and prompt
- The monthly bill, projected from the first two weeks, is at least 20 percent below the pre-pilot trajectory on the volume classes without quality regression on the judgment classes
- The team has a rollback path that returns any class to the prior model within an hour
Fail the pilot if the routing gateway can't:
- Show which model handled a specific call, with the input, output, latency, and cost
- Cap spend at the class level, not just the vendor level
- Document fallback behavior when the primary model returns an error or hits a rate limit
- Stream logs to the team's existing observability layer
- Produce a clear answer on data residency, model versioning, and rollback for the gateway's own service
Methodology
Declared frame: the right model is the cheapest one that doesn't fail the task, and the team's job is to classify the work so the routing decision can be made and defended. The dossier maps ten task classes against the model tier that earns each one, layers in pricing math sampled June 1, 2026, and treats vendor selection as a downstream decision rather than the central one. Sources consulted: vendor pricing pages and model documentation for Anthropic, OpenAI, Google, Mistral, DeepSeek, Meta, and Alibaba; routing-gateway documentation for LiteLLM, OpenRouter, Portkey, Helicone, and Vellum; cross-provider benchmarks from Artificial Analysis; published customer cost data from Anthropic (Claude Code) and OpenAI (GPT-5.5 finance workflow), plus published Intercom Fin outcome-pricing data for product comparison framing. Pricing and feature claims reflect a June 1, 2026 snapshot and shift as vendors revise plans, models, and tiers. In scope: per-task-class model selection for production AI stacks running between 100,000 and 10,000,000 model calls a month across the standard chat, agent, code, vision, and reasoning workloads. Above 10,000,000 calls a month, the routing decisions still apply, but enterprise commit pricing, custom rate-limit tiers, and self-hosted open-weight options change the cost math non-linearly. Out of scope: deep prompt-engineering technique comparison, fine-tuning economics, dedicated inference deployments, and the agent-platform-versus-workflow-tool decision (covered in the prior dossier on work shape).
Sources
- Anthropic — Pricing
- Anthropic — Claude models
- Anthropic — What's new in Claude 4.8
- Anthropic — Choosing a Claude model
- Anthropic — Vision
- Anthropic — Rate limits
- Anthropic — Claude Code costs
- OpenAI — Pricing
- OpenAI — GPT-5.5
- OpenAI — GPT-5
- OpenAI — GPT-5 mini
- OpenAI — GPT-5.4 nano
- OpenAI — GPT-5 nano
- OpenAI — o3
- OpenAI — o4-mini
- OpenAI — Introducing GPT-5.5
- Google — Gemini API pricing
- Google — Gemini 2.5 Pro
- Google — Gemini 2.5 Flash
- Google — Gemini 2.5 Flash-Lite
- Google — Gemini rate limits
- Mistral — Pricing
- Mistral — Model overview
- Mistral — Rate limits
- DeepSeek — Pricing
- DeepSeek — Tool calls
- Meta — Llama 3.3 model card
- Alibaba — Qwen2.5 72B Instruct
- Alibaba — Qwen2.5 blog
- Artificial Analysis — Models
- Artificial Analysis — GPT-5.5 providers
- Artificial Analysis — Claude Opus 4.8 providers
- Artificial Analysis — Claude Sonnet 4.6 providers
- Artificial Analysis — Gemini 2.5 Flash providers
- Artificial Analysis — Gemini 2.5 Flash-Lite providers
- Artificial Analysis — Anthropic provider
- OpenRouter — Llama 3.3 70B Instruct
- OpenRouter — Qwen2.5 72B Instruct
- OpenRouter — Gemini 2.5 Pro
- OpenRouter — Quickstart
- OpenRouter — Provider routing
- LiteLLM — Simple proxy
- Portkey — AI gateway
- Helicone — Gateway overview
- Helicone — Caching
- Vellum — Observability
- Vellum — Evaluations
- Fin — Pricing
- Fin — Lightspeed customer story
Tools Mentioned
- Claude Opus 4.8 — Anthropic's frontier model for autonomous engineering, deep synthesis, and high-stakes agentic loops, priced $5 input and $25 output per million with 1M contextClaude Opus 4.8
- Claude Sonnet 4.6 — Anthropic's balanced mid-tier model for synthesis, generation, agents, and most production work, priced $3 / $15 per million with 1M contextClaude Sonnet 4.6
- Claude Haiku 4.5 — Anthropic's efficiency-tier model for fast extraction, classification, and sub-agents, priced $1 / $5 per million with 200K contextClaude Haiku 4.5
- GPT-5.5 — OpenAI's frontier model for autonomous coding and professional analysis, priced $5 / $30 per million with 1.05M contextGPT-5.5
- GPT-5 — OpenAI's mid-tier model for reasoning-heavy work below GPT-5.5, priced $1.25 / $10 per million with 400K contextGPT-5
- GPT-5 mini — OpenAI's value-tier model for shallow code, summarization, and sub-agent work, priced $0.25 / $2 per millionGPT-5 mini
- GPT-5.4 nano — OpenAI's current small default for classification, extraction, and ranking, priced $0.20 / $1.25 per millionGPT-5.4 nano
- o3 and o4-mini — OpenAI's reasoning models for derivation-heavy work, priced $2 / $8 (o3) and $1.10 / $4.40 (o4-mini) per milliono3 and o4-mini
- Gemini 2.5 Pro — Google's frontier reasoning model with long-context strength, priced $1.25 / $10 per million below 200K and $2.50 / $15 aboveGemini 2.5 Pro
- Gemini 2.5 Flash — Google's mid-tier multimodal workhorse, priced $0.30 / $2.50 per million with 1M contextGemini 2.5 Flash
- Gemini 2.5 Flash-Lite — Google's efficiency-tier model for extraction, classification, and simple vision, priced $0.10 / $0.40 per millionGemini 2.5 Flash-Lite
- Mistral Large 3 / Medium 3.5 — Mistral's open-weight-friendly flagship and coding-and-reasoning model, priced $0.50 / $1.50 and $1.50 / $7.50 per millionMistral Large 3 / Medium 3.5
- Codestral / Devstral — Mistral's code-specialized models for shallow code assist and narrow coding agentsCodestral / Devstral
- Mistral OCR — Mistral's document-understanding model for scanned PDFs and layout extraction, priced $2 per 1,000 pagesMistral OCR
- DeepSeek V4 Flash / V4 Pro — DeepSeek's current open-weight-style models with OpenAI and Anthropic-compatible endpoints, priced from $0.14 / $0.28 per million on V4 FlashDeepSeek V4 Flash / V4 Pro
- Llama 3.3 70B — Meta's open-weight text model, available through hosted providers like OpenRouter and Together with varying pricingLlama 3.3 70B
- Qwen 2.5 72B — Alibaba's open-weight multilingual model with strong code and math, available through OpenRouter and similar hostsQwen 2.5 72B
- LiteLLM — OpenAI-compatible proxy and gateway for 100+ LLMs with spend tracking, budgets, fallbacks, and cachingLiteLLM
- OpenRouter — Unified endpoint for many models with provider routing, fallbacks, and BYOK supportOpenRouter
- Portkey — AI gateway with observability, retries, fallbacks, caching, and cost controls across 1,600+ modelsPortkey
- Helicone — Gateway and observability platform with usage and cost dashboards, intelligent routing, and cachingHelicone
- Vellum — Prompt and workflow platform with execution traces, online evaluations, and regression monitoringVellum
- Artificial Analysis — Cross-provider benchmark site for latency, output speed, price, and feature comparisonArtificial Analysis
Share


