An agent that handles errors well looks identical to one that does not, right up until the day a tool returns a 500 in the middle of a refund chain and the customer ends up with the money back but the access revoked. The difference between the two agents shows up as a customer-visible failure mode, not as a metric on a dashboard. This piece is the playbook for not getting the call.

Error handling for agents borrows heavily from distributed-systems practice (sagas, idempotency, circuit breakers) and adds two agent-native concerns: bounded retry budgets that account for token cost, and re-planning as a recovery mechanism rather than retry. Both extensions are required because agents are non-deterministic and cost real money per call.

Error classes you must distinguish

Treating all errors with the same retry policy is the failure mode I have most often watched in production agents. Classify first, then apply the class policy.

Transient network errors

Connection reset, DNS hiccup, gateway timeout. Retry with exponential backoff and jitter; expect resolution within 1-3 attempts. Standard distributed-systems practice (AWS Builders' Library, Exponential Backoff and Jitter).

Model refusals

The model returned a structured refusal or violated the output schema. Retrying the same prompt likely produces the same refusal. Re-plan: change the approach, simplify the task, or escalate to a larger model.

Tool 4xx errors

Bad request, unauthorised, not found. These are logic errors. Retrying does not fix them. Halt and escalate, or have the agent re-plan with corrected arguments.

Tool 5xx errors

Server error from the tool. Retry once with backoff; if still failing, treat as a transient outage and either pause the agent or fall back to a degraded path.

Schema validation failures

The tool returned 200 but the response shape did not match expectations. Common when an upstream API changes silently. Log the drift, alert, and either abort or treat as if the call failed (do not pass malformed data downstream).

Idempotency keys

Every non-idempotent tool call requires an idempotency key. The key is generated once per logical operation and sent with the request. If the request fails or times out, retrying with the same key is safe; the server treats the second request as a no-op if the first succeeded.

Stripe formalised idempotency keys for payment APIs and the pattern has become standard practice for tool design (Stripe idempotency documentation). For agent design: every tool that writes (charge, send, create, update, delete) must accept an idempotency key. If a third-party tool does not, wrap it in a client-side dedup layer.

Retry policy and budget

Retries cost money. Each LLM call to re-plan after a failure adds tokens. Each tool retry adds an API call. The retry budget is therefore part of the run budget; uncapped retries are how you turn one failure into a thousand-dollar bill.

Retry budget at three layers

Exponential backoff with jitter

Sleep base · 2^attempt + jitter between retries. Without jitter, simultaneous failures across many runs synchronise into a thundering herd that DDoSes the recovering tool. AWS Builders' Library has the canonical analysis (AWS, Timeouts, Retries, and Backoff).

Saga pattern for tool chains

The saga pattern (Garcia-Molina and Salem, 1987) is the right primitive for agent rollback. A saga is a sequence of local steps; each step has a compensating action; if step N fails, the orchestrator runs compensations for steps N-1 down to 1 in reverse order (Garcia-Molina and Salem, Sagas, ACM SIGMOD 1987).

Worked example

An agent processes a new customer: (1) create CRM record, (2) provision Stripe customer, (3) send welcome email, (4) grant access. Compensating actions: (1) archive CRM record, (2) delete Stripe customer, (3) send retraction email, (4) revoke access. If step 4 fails, the agent runs compensations 3, 2, 1 in order. The customer is left in the same state as before the run.

Where sagas fail

Some side effects cannot be reversed. The welcome email already left the building. The compensating action is then an apology or a follow-up correction, not a true undo. Document which steps are reversible and which require apologetic compensation.

Circuit breakers and bulkheads

If a downstream tool is failing, the worst thing the agent can do is keep calling it. The circuit breaker pattern (Nygard, Release It!, 2007) wraps the call: a failure rate above threshold opens the breaker; subsequent calls fail fast; after a cool-down the breaker half-opens and probes; success closes it again (Martin Fowler, CircuitBreaker).

Bulkheads isolate resource pools so one failing tool does not exhaust the agent runtime. A pool of N workers per tool prevents a slow tool from blocking faster ones. Useful at scale; less critical for single-agent deployments.

Human-in-the-loop fallback

For irreversible or high-stakes actions, the fallback is not "retry" or "compensate". It is "ask a human". The agent stages the action, the human reviews, the human approves the commit. For the security view see AI agent security best practices; for the guardrail design see AI agent safety and guardrails.

The escalation path must be specific: which Slack channel, which on-call, which time-to-acknowledge SLA. Vague escalation is worse than no escalation.

Concrete worked examples

Stripe webhook double-fire

Stripe occasionally retries webhooks if it does not see a 2xx response in time. Without an idempotency key, an agent processing a payment webhook may credit the customer twice. Solution: dedup on the Stripe event id at the agent's edge.

Email send failure mid-batch

The agent sends 200 welcome emails; SendGrid returns 500 on email 150. The agent retries 150; success. The agent does not retry 1-149 because they already succeeded. Idempotency keys per recipient make this safe.

Slack approver on PTO

The agent waits for human approval; the approver is on PTO; the run hangs. Solution: timeout on the wait, escalate to a secondary approver, log the original approver's miss. For the broader workflow-break view see why I bet against workflow platforms in 2026.

Frequently asked questions

How do you handle errors in an AI agent?

Classify the error class, apply the policy for the class, bound retries with a budget. Transient errors get backoff; refusals get re-plan; 4xx errors escalate; 5xx errors retry once.

Can an AI agent roll back its actions?

Yes, using the saga pattern. Each tool call registers a compensating action at commit time. Failure triggers reverse-order compensation. Some effects are irreversible; the compensating action is then an apology.

Should AI agents retry failed actions?

Selectively. Idempotent operations are safe. Non-idempotent operations require idempotency keys. Total retries are capped by a budget.

What is the saga pattern for AI agents?

A long-running transaction made up of local steps, each with a compensating action. Failure of step N runs compensations for N-1 through 1 in reverse.

What is a circuit breaker for an AI agent?

A guard that stops calling a downstream tool when its failure rate crosses threshold. Opens, cools down, half-opens, probes, closes on success.

Three things to ship this week

Sources