Agentic Means Four Things
By 2026 'agentic' covers four categorically different system classes, and conflating them is already causing procurement, engineering, and regulatory errors.

In December 2024, Anthropic published Building Effective Agents, an engineering post that defined workflows as "systems where LLMs and tools are orchestrated through predefined code paths" and agents as "systems where LLMs dynamically direct their own processes and tool usage." The distinction held the field together for about a year. By mid-2026 it doesn't anymore. The word "agentic" now covers four categorically different system classes that neither Anthropic's binary nor the academic taxonomies fully resolve, and the conflation is starting to produce real procurement and engineering errors.
METR's Time Horizons benchmark, latest revision published May 8, 2026, makes the spread visible from the autonomy side. The benchmark measures the length of task at which a frontier model has 50% success against a human-expert baseline. Two years ago the top of the curve was a few minutes. Today it sits in the multiple-hour range, with Claude Mythos Preview leading. Production "agents" in the wild span the full curve: some are single-turn tool callers wrapping a model in a thin code loop, some operate for hours without supervision. The word doesn't distinguish them.
The Four Classes
No published source bundles these four together as a single taxonomy. The class names below come from established field vocabulary; the four-way packaging is a synthesis that doesn't appear elsewhere as a unit.
Single-turn tool user. A model that emits tool calls within a single response turn. The model picks the tool, the runtime executes it, the result feeds back into the same response. No state persists past the turn. The model never sets its own next move. Examples: Claude or GPT-5.5 answering a coding question and emitting one structured function call to fetch documentation, then writing the answer. Anthropic's published patterns of "prompt chaining" and "routing" land here.
Code-orchestrated loop. A workflow where application code wraps the model in a loop and calls tools between turns. The code decides when the loop ends. The model contributes one response per iteration but doesn't control the iteration. Examples: a customer-support ticket pipeline that classifies, retrieves history, drafts a reply, runs a policy check, and ships. Anthropic's "orchestrator-workers" and "evaluator-optimizer" patterns land here. So do most production retrieval-and-generation setups, the kind where the system fetches relevant context before the model writes.
Plan-execution agent. A system where the model produces a plan, the runtime executes plan steps, and the model re-enters when steps fail or finish, updating the plan. The model now controls the loop, but the plan itself acts as a contract between model and runtime. Examples: Codex /goal blocks that have demonstrably run for hours on autopilot, and Claude Code dynamic workflows shipped on May 28, 2026 alongside Opus 4.8. The plan is the contract; the runtime enforces it; the model fills in the steps.
Long-running autonomous system. A system that operates for hours to days with the model itself maintaining state, deciding when to stop, and recovering from its own mistakes. No surrounding code controlling the loop, and no fixed plan to follow. The model is the orchestrator. Examples: ChatGPT Agent (launched July 17, 2025, merging Operator and Deep Research) and Claude Cowork (January 2026 desktop app for knowledge work). These are the systems the METR benchmark is built to measure.
Same word "agentic" gets applied to all four. The differences between them aren't cosmetic; they change what the system can do, what it can fail at, and what governance it needs.
Why the Binary Fails
Anthropic's two-category framing, code-driven workflows on one side and model-directed agents on the other, is real, useful, and worth keeping. The post is one of the clearest pieces of structural writing the AI engineering field has produced. The limitation is that the binary still maps two of the four classes onto each side. Single-turn tool users and code-orchestrated loops both sit on the workflow side. Plan-execution agents and long-running autonomous systems both sit on the agent side.
Inside each side, the differences matter. A single-turn tool user fails by returning a wrong tool call; a code-orchestrated loop fails when an error in the wrapping code goes uncaught. Both qualify as workflows in the Anthropic framing, but they fail in categorically different ways and need different monitoring to spot the failures.
The agent side splits worse. A plan-execution agent fails by producing an unworkable plan, and the runtime can detect the failure when steps return errors against the plan's stated completion criteria. A long-running autonomous system fails by drifting from the goal without any fixed plan or checkpoint the runtime can verify against. The first failure is debuggable. The second often only becomes visible when the human comes back and reads the trace.
So the two-category framing is correct as far as it goes. The four-class taxonomy is what production reality requires.
The Five Structural Axes
The classes differ along five axes. None is novel on its own; the contribution is that together they let a buyer or builder place a system precisely.
Autonomy duration. How long the system runs between human checks. The four classes spread across orders of magnitude on this axis: single-turn tool users in seconds, code-orchestrated loops in minutes to a few hours, plan-execution agents in hours, long-running autonomous systems in hours to days. METR's Time Horizons benchmark formalizes the axis with a measurement protocol, and frontier models have doubled their time horizon every seven months from 2019 through 2025, with the rate accelerating since. Where a system sits on the autonomy-duration axis is the closest single proxy for which class it falls into.
Tool surface. How many distinct actions the model can invoke, and whether those actions are revocable. The tool count climbs across the classes: a single-turn tool user typically has one to three, a code-orchestrated loop a fixed set of five to fifteen chosen by the code author, a plan-execution agent whatever the plan calls for. Long-running autonomous systems, particularly those with computer-use access like Claude Cowork or ChatGPT Agent, have effectively unlimited tool surface, which is part of what makes governance so hard.
Recovery behavior. What happens when a step fails. The first three classes hand recovery off to something other than the model: a single-turn tool user to the user's next message, a code-orchestrated loop to its error-handling and retry logic, a plan-execution agent to the plan re-entry. The fourth class is structurally different. The long-running autonomous system has the model decide what to do about its own failure, which is what makes the recovery axis the central distinction at the top of the autonomy curve.
State persistence. Whether state carries between runs. Single-turn tool users keep none; the other three classes each persist something specific. Code-orchestrated loops keep whatever the application chooses to hold, usually the conversation history and any retrieved context. Plan-execution agents persist the plan itself across runs. Long-running autonomous systems carry their own working memory, often informally inside the context the model reads at each step and increasingly through dedicated memory storage outside the model.
Decision authority. Whether the system can act, or only propose actions for a human to confirm. The first three classes act with progressively narrower constraints: single-turn tool users act directly, code-orchestrated loops within the code's constraints, plan-execution agents on plan steps but often pausing on irreversibles. Long-running autonomous systems vary by deployment: ChatGPT Agent requires human confirmation on financial transactions but acts freely on read-only work; Claude Cowork prompts on actions and waits for the user's approval. This is the axis the Cloud Security Alliance's L0 to L5 autonomy scale tries to formalize, by analogy with SAE's autonomous-driving levels.
A compact comparison across all five axes:
| Class | Autonomy duration | Tool surface | Recovery | State | Decision authority |
|---|---|---|---|---|---|
| Single-turn tool user | Seconds | 1-3 tools | Human's next message | None | Acts |
| Code-orchestrated loop | Minutes to hours | 5-15 fixed tools | Wrapping code catches | Application-held | Acts within constraints |
| Plan-execution agent | Hours | Plan-focused subset | Plan re-entry | Plan persists | Acts, confirms on irreversibles |
| Long-running autonomous | Hours to days | Effectively unlimited | Model self-correction | Working memory | Varies; confirms on high-stakes |
Where the Conflation Hurts
Three concrete failure modes come out of treating the four classes as one thing.
The first is procurement. A team writes an RFP for "an agentic customer-support system." The vendors that respond span all four classes. The cheapest pitch is a code-orchestrated loop with a thin model wrapping. The most expensive is a long-running autonomous system with computer-use access. Both are "agentic" by the team's wording. The team doesn't have language to ask the right disambiguating questions, so the comparison collapses to feature checklists and reference customers, neither of which reveals the class difference. The team often ends up paying long-running-autonomous prices for what their workflow actually needs from a code-orchestrated loop.
The second is engineering under-investment in recovery. Builders who think of their system as "an agent" expect the model to handle failures internally, because that's how long-running autonomous systems work. When the system is actually a code-orchestrated loop, the wrapping code is what has to catch and recover, but the code rarely does because the team's mental model said the model would. Production support tickets pile up, the team adds prompt instructions to handle each new failure mode, and the system slowly accretes prompt complexity that should have been recovery code.
The third is regulatory misclassification. The EU AI Act's systemic-risk classification, applicable from August 2, 2026, names capability thresholds and behavior characteristics without distinguishing between the four classes. A single-turn tool user with access to a powerful underlying model can score the same on a capability benchmark as a long-running autonomous system, but the deployment risk profiles are radically different. Regulators classifying on the model alone, without naming which class deploys it, are pricing risk on the wrong axis.
The Strongest Rebuttal
The strongest counter-reading is that the field is converging on a working vocabulary, and the four-class taxonomy is an over-engineered response to a transient definitional gap. Anthropic distinguishes workflow from agent. The Cloud Security Alliance proposes L0 to L5 autonomy levels. arXiv preprints from 2025 and 2026 propose Goal-Complexity, Environmental-Complexity, and Adaptability dimensions. The field knows what it means by "agentic;" the editorial just needs to wait.
The convergence is real and the convergence is also partial. Anthropic's binary settles two of the four classes onto one label and the other two onto the other label, which is precisely what produces the conflation in the first place. The CSA L0-L5 scale formalizes the autonomy-duration axis but doesn't name the other four. The arXiv taxonomies are academically rigorous and unreadable for practitioners; in 2026 procurement, an arXiv ranking is not what's organizing the RFP.
A field can converge on a word while the structural concept underneath that word is still drifting. That's the situation in mid-2026. The four-class taxonomy lasts until the field formalizes something better. When it does, upgrade the taxonomy and keep the discipline of naming the axes underneath whatever word the field settles on.
What to Carry
Replace the noun with the axis. When a vendor pitch, an RFP, or an internal doc says "agentic," the right next question is which class, and the better next question is which axis matters for the call being made.
- Buying or building a customer-support workflow that runs at volume? The axes that matter are recovery behavior and tool surface. Don't buy long-running autonomous when code-orchestrated loop fits.
- Building a research assistant that has to last for hours and produce a deliverable? Plan-execution is the floor. Long-running autonomous if the plan can't be written ahead of time.
- Reviewing a vendor's risk disclosure? Read for decision authority and state persistence. Confirmation gates on irreversibles are the procurement-relevant signal.
The word "agentic" will keep being used because the market wants a single label for an exciting category. The label is fine. What it conceals is what the next two years of agent governance, procurement, and engineering will turn on. Naming the axes is the discipline that keeps the conversation honest.
Share


