Most AI agents stop after one task. They run the first step, return a confident-sounding output, and then either silently halt, hand back to the human, or hallucinate a "task complete" status that does not match reality. This is not a marketing failure; it is a measurable engineering pattern documented in the public benchmarks. The pattern has four causes, a clear compounding-probability shape, and a small set of tests that surface it before procurement.
The numbers from GAIA are unambiguous: human pass rates above 90 percent on the benchmark, top agent systems below 50 percent on Level 3 multi-step questions (Mialon et al., 2023). SWE-bench shows the same shape: pass rates drop with each additional file the agent must touch (SWE-bench leaderboard, retrieved 2026-05-07). The drop is not about model capability; it is about how systems compound errors across steps.
The pattern, in numbers
GAIA, the General AI Assistants benchmark, reports human pass rates above 90 percent. The strongest evaluated agent systems score below 50 percent on Level 3, which is the multi-step, multi-tool tier. The gap shrinks dramatically on Level 1, where one tool and one step is enough. The pattern is not "AI is bad at agents"; it is "AI is much worse at the second step than the first" (Mialon et al., GAIA, 2023).
SWE-bench, which tests AI on real GitHub issue resolution, shows the same shape from a different angle. Issues that require modifying one file pass at substantially higher rates than issues requiring four or more files for the same agent system. AgentBench, a separate cross-environment benchmark, reports analogous patterns across web tasks, code tasks, and game tasks (Liu et al., AgentBench, 2023). The shape is consistent enough that it should be the default expectation, not a surprise.
Four causes of multi-step decay
The four causes are not exhaustive but cover the majority of stop-after-one-task incidents. They compound. A system can have all four operating simultaneously, which is why the symptom looks worse than any single cause predicts.
- Context drift. The agent loses track of the original goal across steps. Long contexts dilute the goal; tool outputs introduce competing instructions; the agent ends up answering the most recent input rather than the original task. Memory architecture (covered in memory explained) is the structural fix.
- Brittle tool schemas. The agent calls a tool that returns a slightly different shape than expected. The agent either hallucinates the missing field or fails the parse and stops. Schema drift is one of the eight categories in the 80-test methodology for that reason.
- Missing error recovery. Tool returns 5xx, times out, hits a rate limit, or returns a partial result. The agent has no recovery loop. It surfaces the error to the user and halts. Real production environments produce these errors constantly; agents that lack recovery cannot survive them.
- Absent completion checks. The agent finishes step one and reports success without verifying. The completion criterion is implicit, not explicit. A workflow with five steps becomes a workflow with one step plus four hopeful guesses.
The compounding math
The math is the headline. If each step has a 90 percent success rate, five independent steps have a 0.9 to the fifth power success rate, which is 59 percent. Ten steps drop to 35 percent. This is the joint-probability ceiling on a system without error recovery; in practice the rate is worse because steps are correlated (a failure in step two often biases steps three through five).
The implication for buyers is direct. A demo of one impressive step proves nothing about a five-step task. Insist on the run that goes the full distance, ten times in a row. The number of completed runs out of ten is the only honest reliability claim available without internal benchmarks.
How to test for stop-after-one-task
The fastest buyer-side test is structural. Pick a task with at least five dependent steps. Run it ten times with slight input paraphrases. Count how many runs reach step five with the correct output. If fewer than nine of ten reach step five, the system has multi-step decay. Then inject a controlled failure at step three (a tool error or a malformed response). If the agent does not recover, it lacks error recovery. Both are common; both are disqualifying for production use.
The structural test mirrors three of the eight categories in the 80-test methodology: input variation (the paraphrases), tool failure (the injected error), and partial results (the runs that complete some steps and not others). The methodology runs ten tests per category for the same reason buyers should: ten is the smallest number that produces stable estimates for non-deterministic systems.
What fixing it looks like
Fixing stop-after-one-task is not one change. It is four changes, applied together. Memory architecture stops context drift. Schema validation with retry stops brittle tool failures. Explicit error-recovery loops with bounded retry counts stop tool errors from halting the system. Explicit completion checks (the agent must verify the end-state matches the goal before reporting success) stop the optimistic-finish failure mode.
Each of these is documented separately in the cluster: memory, tool use, orchestration, reasoning. The four together are the structural answer to multi-step decay. The 80-test methodology is the operational answer that confirms the structure works under realistic stress.
The pragmatic answer for buyers in 2026 is to assume multi-step decay until proven otherwise. The product framing in describe outcome, not workflow exists because outcome-described tasks force the system to plan past step one. If a vendor cannot describe how their system handles step five, that is the answer.
Frequently asked questions
Why do AI agents stop after one task?
Four causes dominate: context drift across steps, brittle tool-call schemas, missing error recovery, and absent task-completion checks. Each compounds with the next. After four to five steps, the joint probability of failure exceeds the joint probability of success even when each individual step has a 90 percent success rate.
What does GAIA say about multi-step agent reliability?
GAIA reports human pass rates above 90 percent and the strongest evaluated agent systems below 50 percent on harder multi-step tasks. The gap is largest on Level 3 questions that require multi-tool, multi-step reasoning. Single-tool, single-step questions show much smaller human-to-agent gaps.
How does SWE-bench show multi-step decay?
SWE-bench measures real GitHub issue resolution, which often requires reading multiple files, modifying code, and running tests. Pass rates drop sharply with each additional file the agent must touch. Issues requiring four or more files show pass rates roughly half of single-file issues for the same agent system.
How can a buyer test for stop-after-one-task in an AI agent?
Run the same task with at least five dependent steps, ten times. Track how many runs reach step five. If fewer than nine of ten reach step five, the agent has multi-step decay. Then introduce a tool error at step three; if the agent does not recover, it lacks error recovery; both are common.
How does Gravity handle multi-step reliability?
Through the 80-test methodology run before any capability ships. The methodology weights partial-results failures and idempotency failures more than paraphrase failures, because those are the failures that produce stop-after-one-task behaviour. A capability does not ship until the weighted pass rate exceeds 95 percent across all eight categories.
Three takeaways before you close this tab
- The benchmarks confirm the gap. Step-one is fine; step-five is where most systems fail.
- The math is unforgiving. 90 percent per step is 59 percent at five steps.
- The buyer-side test is fast. Ten runs of a five-step task; count how many reach step five.
Sources
- Mialon et al., "GAIA: A Benchmark for General AI Assistants", arXiv:2311.12983, 2023, retrieved 2026-05-07, arxiv.org/abs/2311.12983
- SWE-bench, "Leaderboard for software engineering benchmark", retrieved 2026-05-07, swebench.com
- Liu et al., "AgentBench: Evaluating LLMs as Agents", arXiv:2308.03688, 2023, retrieved 2026-05-07, arxiv.org/abs/2308.03688
- Anthropic, "Building Effective Agents", retrieved 2026-05-07, anthropic.com/engineering/building-effective-agents
- OWASP, "Top 10 for LLM Applications", retrieved 2026-05-07, owasp.org