Whether AI agents "reason" is a debate that often misses the practical point. The practical point is that different reasoning patterns produce different reliability characteristics on different tasks. Chain-of-thought helps with multi-step arithmetic and logic. ReAct helps when tool calls interleave with reasoning. Tree-of-thought helps when the right next step is not obvious. Pattern matching, the model's default, handles tasks well-represented in training data. Knowing which pattern is in use is more important than knowing whether it counts as reasoning.

This post walks through the three patterns that dominate 2026 agent design, with citations to the original papers, and then lays out a per-task selection rubric. The vocabulary is settled: chain-of-thought from Wei et al. 2022, ReAct from Yao et al. 2022, tree-of-thought from Yao et al. 2023. The pragmatic question is which pattern survives the buyer's test.

The question, sharpened

"Reasoning" in the strict philosophical sense is contested. "Reasoning" in the practical AI sense is operational: producing intermediate steps that lead from input to a defensible answer. By the operational definition, modern LLMs do reason on tasks where chain-of-thought or ReAct prompts are used. They do not always reason in ways that survive scrutiny on novel inputs; that is where pattern matching takes over and where benchmarks like GAIA show the gap (Mialon et al., 2023).

The buyer-side question is not "does the agent reason?" but "what reasoning pattern is in use, on which steps, and how was reliability measured per pattern?" An agent that uses ReAct for tool-use steps and chain-of-thought for analytical steps has a different failure profile than one that uses pure pattern matching for everything. Both can be correct; both can be wrong; the discipline is knowing which is in play.

Chain-of-thought

Chain-of-thought (CoT) is the simplest reasoning pattern. The model produces intermediate reasoning steps before the final answer. Wei et al. introduced the technique in 2022 (arXiv:2201.11903); the original paper reported substantial accuracy gains on math word problems and multi-step reasoning benchmarks. The pattern is now standard in most agent prompts, often invoked via a "think step by step" instruction or built into the model behaviour itself.

CoT works well on closed-form reasoning: math, logic puzzles, multi-step extraction from a single document. It works less well when reasoning needs to interact with the external world; on those tasks the chain becomes hypothetical and untethered. The classic CoT failure mode is the chain that "looks like reasoning" but is actually narrative-shaped pattern matching: the steps follow a familiar template, the final answer matches the template, the underlying logic is wrong.

The mitigation is verification: after the chain, the agent runs a separate check (rerun the calculation, query a tool, compare against a ground truth). The 80-test methodology category for partial-results catches CoT-only failures where the chain looks complete but the answer is wrong; ten tests per capability run paraphrased inputs and check whether the conclusions stay consistent.

ReAct: reasoning plus acting

ReAct stands for Reasoning + Acting. Yao et al. introduced it in 2022 (arXiv:2210.03629). The pattern interleaves reasoning steps with tool calls: think, act, observe, think, act. The model writes a "thought" describing what it plans, then emits a tool call, then reads the tool result, then writes the next thought. The pattern is the default execution mode for most agent frameworks in 2026.

ReAct dominates production agent design because most useful agent work is tool-mediated. Pure CoT cannot send an email or read a CRM record; ReAct can. The thought-action-observation loop also produces useful logs: every step has a written rationale, which makes debugging tractable. The Anthropic engineering blog notes that ReAct-style logging is one of the cheapest reliability investments available (retrieved 2026-05-07).

ReAct's failure modes are inherited from its components. The reasoning side fails when the thought is hypothetical and the agent acts on the imagined state instead of the real one. The action side fails when tool selection or schema is wrong. The discipline is to log the thought, the action, and the observation separately and check that each matches reality. Tool use explained covers the action-side failure modes in depth.

Reasoning pattern fit by task type (qualitative) CoT: math/logic strong CoT: tool-mediated weak ReAct: tool-mediated very strong ReAct: pure reasoning strong ToT: hard novel steps strong ToT: routine steps expensive Pattern match: novel unreliable Source: Aryan Agarwal qualitative synthesis of CoT (Wei 2022), ReAct (Yao 2022), ToT (Yao 2023), 2026.
Each pattern has a sweet spot. Production agents typically combine ReAct with selective tree-of-thought on hard steps.

Tree-of-thought

Tree-of-thought (ToT) generalises chain-of-thought by exploring multiple reasoning paths and selecting the best one. Yao et al. introduced the pattern in 2023 (arXiv:2305.10601). At each step the model produces several candidate continuations, evaluates them against a heuristic, and explores the most promising. The pattern is essentially beam search applied to reasoning instead of token generation.

ToT works on problems where the right next step is not obvious and pruning bad paths early is important: planning under constraints, creative tasks with quality criteria, hard logic puzzles. The cost is multiple inferences per step, which adds up fast. For routine work where ReAct produces correct results on the first try, ToT is wasteful. The 2026 production pattern: ReAct as default, ToT invoked selectively on steps flagged as hard by a router.

The router itself is a design choice. Common implementations: a small model classifies each step as easy or hard and dispatches accordingly; a confidence threshold on the ReAct output triggers a ToT fallback; a fixed list of step types always uses ToT. Each has trade-offs in cost and accuracy. The economic implications are documented in economics of bootstrapped AI agents; reasoning pattern choice is one of the largest cost levers.

Choosing a pattern per task

The selection rubric: ReAct as default for tool-mediated tasks (most production work). Chain-of-thought when the task is closed-form reasoning with no tool calls. Tree-of-thought when a step is genuinely hard and the cost of a wrong answer is high. Pattern matching alone for tasks well-represented in training data where speed matters more than reliability (auto-complete, formatting, simple classification).

The buyer-side question is direct: which pattern is in use on the steps that matter, and how was reliability measured? A vendor that says "the model just figures it out" is using pattern matching with no fallback. A vendor that can describe ReAct logging, the router for ToT escalation, and the per-pattern reliability numbers has done the work. The 80-test methodology is the operational expression of this discipline at Gravity.

The framework also informs build-vs-buy: building a competitive reasoning stack from scratch is expensive (the patterns are public, the prompt engineering and per-step routing are not). For most buyers, this is one of the strongest arguments to buy rather than build, covered in build vs buy.

Frequently asked questions

Do AI agents actually reason or just pattern match?

Both, depending on the task. Modern agents combine pattern recognition (the LLM's strength on familiar inputs) with explicit reasoning patterns (chain-of-thought, ReAct, tree-of-thought). On tasks well-represented in training data, pattern matching dominates and works well. On novel multi-step tasks, explicit reasoning patterns improve outcomes; benchmarks like GAIA show the gap clearly.

What is chain-of-thought reasoning?

Chain-of-thought is a pattern where the model produces intermediate reasoning steps before the final answer. Wei et al. introduced the technique in 2022; the original paper showed substantial accuracy gains on math word problems (arXiv:2201.11903). The pattern is now standard in most agent prompts, often invoked via a 'think step by step' instruction or built into the model.

What is ReAct?

ReAct stands for Reasoning + Acting. Yao et al. introduced it in 2022 (arXiv:2210.03629). The pattern interleaves reasoning steps with tool calls: the model thinks, then acts, then observes the result, then thinks again. ReAct is the default execution pattern for most agent frameworks in 2026 because it handles tool use more reliably than pure chain-of-thought.

What is tree-of-thought?

Tree-of-thought, introduced by Yao et al. in 2023 (arXiv:2305.10601), generalises chain-of-thought by exploring multiple reasoning paths and selecting the best one. The pattern works for problems where the right next step is not obvious; it costs more inference because multiple branches are evaluated. Useful for hard reasoning steps; expensive for routine ones.

Why does the reasoning vs pattern matching distinction matter?

It determines where the agent will fail. Pattern matching fails on novel inputs that look familiar but require different reasoning. Reasoning-heavy patterns fail when the chain-of-thought wanders or hallucinates intermediate steps. Buyers should ask vendors which reasoning pattern is in use and how it was tested; the 80-test methodology covers both modes.

Three takeaways before you close this tab

Sources