Single-Agent vs Multi-Agent: When Do You Actually Need More Than One?

The single-agent vs multi-agent debate is one of the more confused conversations in AI agent design, partly because both sides frame the question as architectural when it is really economic. The technical question is "what can multiple agents do that one cannot?" The practical question is "is the coordination cost worth what you gain?" The honest answer for most operator tasks is no.

Multi-agent research has real momentum: Anthropic's published guidance on building agents (Anthropic engineering, "Building Effective Agents") explicitly recommends starting with a single agent and only graduating to multi-agent when one cannot meet the requirement. That framing matches what shows up in production: multi-agent systems are powerful but expensive to operate reliably, and the failure modes are harder to debug than single-agent failures.

The definitions, sharpened

A single-agent system uses one AI agent loop. The agent sees the goal, picks tools, executes, observes results, and either continues or stops. It owns the full task end-to-end. The internal complexity is in the agent's reasoning loop, the tool list, and the memory layer, but the architecture is one process.

A multi-agent system uses two or more agents that exchange messages. The shapes vary: a planner-worker split (one agent decomposes, another executes), a critic loop (one agent generates, another reviews), a specialist team (researcher, writer, fact-checker), or a hierarchy (orchestrator at the top, sub-agents below). What unites them is that no single agent owns the task; coordination is part of the architecture.

The framing matters because it changes what failure looks like. In a single-agent system, the agent either succeeds or fails visibly. In a multi-agent system, the failure can be a silent disagreement: two agents continue to act on incompatible assumptions, and the failure surfaces only when the result lands.

When multi-agent helps

Three conditions that make multi-agent worth the operational cost.

Context exceeds one window

If the task involves more context than fits in a single agent's window, splitting becomes necessary. Code reasoning across a large repository, document review across hundreds of pages, or research synthesis across many sources can overflow even modern context limits. A planner agent that decides what each sub-agent reads, plus sub-agents that work on their slice, can solve problems that one agent cannot fit into a single window.

Role specialisation produces measurably better outputs

Some tasks benefit from a critic in the loop. A writer agent that drafts and a critic agent that scores against an explicit rubric can outperform a single agent told to "draft and self-review", because the critic operates without the writer's context and catches what the writer cannot. The same pattern applies to code generation with a separate test-runner agent. The key word is measurably: if the critic loop does not produce higher pass rates on a real test set, it is not worth the coordination cost.

Parallel decomposition is clean

Some tasks split naturally. Enriching 500 leads in parallel does not require coordination beyond a queue; that is "many instances of one agent", not multi-agent in the meaningful sense. But a research task where one agent finds candidates and three sub-agents in parallel investigate each candidate can finish faster than one agent serially. The decomposition has to be clean for parallelism to help; if the sub-agents need to know about each other's findings, you are back to coordination.

When single-agent wins

For most operator tasks, single-agent is the right architecture. Sending follow-ups to leads, enriching contact data from public sources, scheduling, extracting structured data from documents, monitoring an inbox and routing items: all of these decompose poorly across multiple agents and benefit from one agent that holds the full context.

Single-agent wins on three properties that matter in production. First, debuggability: when something goes wrong, the trace is one sequence, not a graph of messages. Second, reliability under the 80-test methodology described in how we test AI agents: the failure surface is bounded by the agent's tool list, not by an unbounded coordination protocol. Third, cost: single-agent runs cost less per task because there are no inter-agent message tokens, and the cost model (covered in AI agent cost models explained) is more predictable.

The honest assessment from running three startups, captured in three startups, three shutdowns, is that buyers reward "this works reliably for what I asked" much more than "this has an impressive multi-agent architecture". Multi-agent is a means, not a feature.

The coordination cost, quantified

The coordination cost has four components. Each gets bigger faster than agent count.

Token, latency, failure-mode, and test-surface costs all grow faster than linearly with agent count. The bars on top are single-agent baselines.

Token cost

Inter-agent messages are model tokens. Every handoff carries context, intent, and partial results. A 3-agent pipeline routinely uses 3-5x the tokens of a single agent on the same task, even when the work itself is no bigger.

Latency

Sequential agents add their latencies. Even when the agents run in parallel, the synchronisation point waits for the slowest. The end-user experience is dominated by the worst path through the agent graph.

Failure-mode multiplication

Each agent has its own failure modes (the eight categories in the 80-test methodology). A multi-agent system has those failures plus handoff failures: lost context, duplicated work, infinite loops between critic and writer, agents that disagree silently. Failure modes do not add; they multiply.

Test surface growth

The test surface grows combinatorially. Two agents with eight failure categories each is not 16 categories; it is 8 + 8 plus the cross-product of handoff failures. Reliability targets that are achievable for a single agent become much harder for a multi-agent pipeline.

A buyer-side rule of thumb

The rule that holds up in practice: start single-agent. Promote to multi-agent only when you can name the specific reason. "It feels more powerful" is not a reason. "The context exceeds one window" is. "The critic loop produces measurably higher pass rates on our test set" is. "The decomposition is clean and parallelism is the bottleneck" is.

This is the same discipline as the 10x check in the three checks I missed: complexity needs to clear a bar, not just exist. Multi-agent architectures that look impressive on a slide often underperform single-agent versions in production because the coordination cost was not in the slide.

For Gravity, the architecture is single-agent for the operator tasks the platform serves. The agent has access to a substantial tool set; the loop is one process; the test methodology is the eight-category gate. When a task class genuinely needs multi-agent (large-codebase reasoning, multi-document research synthesis), Gravity will graduate that capability rather than retrofit multi-agent into the default path. The principle is the same as in describe outcome, not workflow: keep the buyer's mental model simple; absorb complexity inside the agent only when it earns its keep.

Frequently asked questions

What is the difference between a single-agent and a multi-agent system?

A single-agent system uses one AI agent that owns the full task end-to-end, calling whatever tools it needs along the way. A multi-agent system uses two or more agents that pass work between each other, often with role specialisation. Multi-agent adds capability ceilings but multiplies coordination cost and failure modes.

When do you actually need a multi-agent system?

Multi-agent makes sense when one agent's context window cannot hold the full task, when role specialisation produces materially better outcomes, or when parallel work decomposes cleanly. For most operator tasks (sales follow-ups, lead enrichment, data extraction, scheduling) a single agent with the right tools is enough and more reliable.

What is the coordination cost in multi-agent systems?

Coordination cost includes the model tokens spent on inter-agent messages, the latency added by sequential handoffs, the failure modes that emerge at handoff boundaries (lost context, duplicated work, infinite loops), and the test surface that grows combinatorially with agent count. The cost compounds; two agents are not twice as expensive but five times as expensive to run reliably.

Do multi-agent systems perform better on benchmarks?

Mixed. On benchmarks like GAIA where tasks decompose into clear sub-skills, multi-agent systems can outperform single agents at the top of the leaderboard. On benchmarks like SWE-bench where context coherence matters more than parallelism, single-agent solutions often beat multi-agent. The benchmark-to-production gap is real; coordination failures show up in production faster than benchmarks.

How do multi-agent failures differ from single-agent failures?

Single-agent failures are usually visible: the agent stopped, took a wrong action, or refused. Multi-agent failures are often invisible until late: agents disagree silently, one drops context the other expected, or two agents both think the other is handling the next step. Debugging requires tracing the message flow, not just the final output.

Three takeaways before you close this tab

Default to single-agent. Multi-agent is a tool with a real cost.
Multi-agent helps when context overflows, role specialisation pays, or parallel decomposition is clean. Otherwise it loses.
Coordination cost compounds across tokens, latency, failure modes, and test surface. Plan for 3-5x.

Sources

Anthropic, "Building Effective Agents", retrieved 2026-05-07, anthropic.com/engineering/building-effective-agents
Mialon et al., "GAIA: A Benchmark for General AI Assistants", arXiv:2311.12983, 2023, retrieved 2026-05-07, arxiv.org/abs/2311.12983
SWE-bench, "Leaderboard", retrieved 2026-05-07, swebench.com
Aryan Agarwal, "Gravity multi-agent decision spec", internal v1, May 2026, About