Tool use is what separates a chatbot from an agent. A chatbot talks about sending the email; an agent calls the email-send tool and watches for the result. The mechanism under tool use is function calling, standardised by OpenAI in 2023 and adopted by Anthropic, Google, and most open-weight providers. The protocol is settled; the failure modes are not. Most agent reliability work in 2026 lives in tool-use error handling.
This post walks through the four-stage lifecycle of a tool call (selection, schema, execution, parsing), the failure modes at each stage, and the catalogue-design choices that make a tool-using agent reliable. The vocabulary draws from OpenAI's function calling docs (retrieved 2026-05-07) and Anthropic's tool use guidance (retrieved 2026-05-07).
What tool use actually is
Tool use is the agent's ability to call external functions or APIs to act on the world. The agent receives a tool catalogue at the start of each step (names, descriptions, argument schemas), conditions on the current goal and state, and emits a structured payload: tool name plus arguments. The runtime executes the call, returns the result, and the agent reads the result on the next inference. Loop until done or escalation.
The capability is implemented through function calling on most modern platforms. OpenAI's function calling, Anthropic's tool use, Google's function declarations, and open-weight equivalents all expose the same shape: declare the available tools as schemas, the model emits one call per turn, the runtime handles execution. The protocol details vary; the pattern is identical (OpenAI, Anthropic).
The four-stage lifecycle of a tool call
- Selection. Given the goal, current state, and a catalogue of available tools, the model picks one. Selection failures: choosing the wrong tool, choosing none when one is needed, choosing two when one is required.
- Schema construction. The model formats the arguments according to the tool's schema. Failures: missing required fields, wrong types, hallucinated fields.
- Execution. The runtime calls the tool. Failures: 5xx errors, timeouts, rate limits, network errors, downstream API outages.
- Parsing. The model reads the response and extracts the relevant facts for the next step. Failures: schema drift (response changed shape), partial parses, hallucinated extractions.
Each stage has its own failure mode. The reliability rule: instrument all four stages independently. A single "tool use failed" log line is useless for debugging; "tool selection chose the wrong tool" versus "tool execution timed out" versus "parsing missed a field" point to entirely different fixes.
How agents select the right tool
Selection works by conditioning the model on a tool catalogue. The catalogue is a list: each entry has a name, a natural-language description, and a JSON schema for arguments. The model reads the catalogue alongside the goal and current state, then emits the chosen tool name plus arguments. Selection accuracy depends on the descriptions: vague descriptions ("does email things") produce wrong selections; specific descriptions ("send a follow-up email to a single recipient given email address, subject, body") produce correct ones.
Selection accuracy degrades as the catalogue grows. Empirically, around 20 to 30 tools is the breakpoint where selection starts to suffer noticeably; past 50, selection accuracy drops sharply unless the catalogue is filtered before being shown to the model. The two common filtering patterns: hierarchical (tools grouped by category, the agent picks a category first then a tool); retrieval-based (the agent retrieves relevant tools by similarity to the current step).
The Anthropic engineering blog notes that catalogue size is one of the strongest determinants of agent reliability past the prototype stage (Building Effective Agents, retrieved 2026-05-07). The pragmatic rule: keep the per-step catalogue under 25 tools; use retrieval or hierarchy for anything larger.
Error handling: retry, replan, escalate
The default agent error-handling pattern is three-tier: retry with exponential backoff for transient errors (5xx, timeouts, rate limits); replan when the error indicates a schema mismatch or wrong-tool selection; escalate to a human when the agent has exhausted retries and replans without progress. Each tier has its own threshold; a typical configuration is three retries, two replans, then escalation.
The most common failure is not having any of these tiers. The agent calls the tool, gets a 500, and stops. OWASP Top 10 for LLM Applications lists insecure output handling and excessive agency as related categories: an agent without error handling is also an agent more likely to follow hostile instructions or take unsafe actions on partial information.
Idempotency is the second-order concern. If the agent retries a tool call and the first call actually succeeded but the response was lost, the retry can double-execute. For tools with real-world side effects (payments, emails, writes), idempotency keys are non-negotiable. The 80-test methodology weights idempotency at 20 percent (the highest of the eight categories) precisely because the cost of failure is large.
Tool catalogue design
The catalogue is a product. Each tool is a small contract between the agent and a downstream system. Good catalogues share four properties: descriptions are operational (what the tool does, what it returns, when to use it); schemas are strict (no optional fields where required ones would do); side effects are documented (which tools mutate state); idempotency keys are first-class (every mutating tool accepts one).
The catalogue also defines the blast radius. An agent with a tool that can send any email to any recipient has a larger blast radius than one with a tool that can only email leads in the CRM. Constraining tool scope is the cheapest way to limit damage from selection or hostile-input errors. NIST AI RMF treats this as a core risk dimension; the practical implementation is per-tool permission boundaries (NIST AI RMF, retrieved 2026-05-07).
For Gravity specifically, the tool catalogue is treated the same way an API surface is treated in any other product: versioned, tested, deprecated when superseded, monitored in production. The reliability discipline in 80-test methodology stresses the catalogue exactly the same way the runtime is stressed; a tool that passes 80 tests in isolation can still fail when called from a complex multi-step plan.
Frequently asked questions
What is tool use in AI agents?
Tool use is the agent's ability to call external functions or APIs to act on the world. The agent selects the right tool from a catalogue, formats arguments according to a schema, sends the call, and parses the response. Tool use is the difference between a model that talks and an agent that acts; it is implemented through function calling on most modern platforms.
What is function calling?
Function calling is the mechanism by which an LLM emits a structured payload describing the tool to call and the arguments. OpenAI standardised the pattern in 2023, and Anthropic, Google, and most open-weight providers now support equivalent interfaces. Function calling is the plumbing under tool use; the LLM emits, the runtime executes, the response goes back as input.
How does an AI agent select the right tool?
The agent gets a tool catalogue at the start of each turn (names, descriptions, argument schemas). The model conditions on the goal, the current state, and the catalogue, and emits a tool name and arguments. Selection accuracy degrades as the catalogue grows past 20 to 30 tools; large catalogues need retrieval-based filtering or hierarchical organisation.
What happens when a tool call fails?
A robust agent has explicit error-recovery loops: retry with backoff for transient errors, replan on schema mismatches, escalate to a human on irrecoverable errors. A weak agent surfaces the error to the user and stops. Tool failure is one of eight categories in the Gravity 80-test methodology; ten tests per capability stress this exact path.
What is the difference between tool use and function calling?
Tool use is the capability; function calling is the mechanism. An agent uses tools by emitting function calls, the runtime executes them, results come back as inputs to the next inference. Function calling is the protocol; tool use is the design pattern. Vendor lock-in lives in function-calling specifics; tool-use design transfers across providers.
Three takeaways before you close this tab
- Four stages, four failure modes. Selection, schema, execution, parsing.
- Catalogue design is the leverage point. Past 25 tools, filter or fail.
- Idempotency is non-negotiable for mutating tools. The cost of double-execution is the cost of trust.
Sources
- OpenAI, "Function Calling Guide", retrieved 2026-05-07, platform.openai.com/docs/guides/function-calling
- Anthropic, "Tool Use Documentation", retrieved 2026-05-07, docs.anthropic.com/en/docs/build-with-claude/tool-use
- Anthropic, "Building Effective Agents", retrieved 2026-05-07, anthropic.com/engineering/building-effective-agents
- OWASP, "Top 10 for LLM Applications", retrieved 2026-05-07, owasp.org
- NIST, "AI Risk Management Framework", retrieved 2026-05-07, nist.gov/itl/ai-risk-management-framework