How many retries should an AI agent attempt?

Three is the practical default for transport-level retries with exponential backoff. Higher counts amplify outages without adding success probability. Tool-level retries should also cap at three with an idempotency key. Model-level fallback (Opus to Sonnet to Haiku) is a single cascade, not a loop.

What is exponential backoff with jitter?

A retry delay that doubles after each attempt with a randomized component to prevent synchronized retry storms. AWS recommends full jitter where each delay is a uniformly random value between 0 and the current exponential ceiling. Base 100ms, cap 30s is a safe default.

When should I NOT retry an AI agent call?

Never retry irreversible side effects without an idempotency key. Never retry content-policy refusals (the model means it). Never retry validation errors (4xx that are not 429). Never retry past a budget; the next call has the same odds the last call did.

What is an idempotency key for an AI agent?

A unique identifier the agent passes to a tool so the tool can detect duplicates. Stripe popularized the pattern; the agent sends the same key on the original call and any retry. The tool de-duplicates server-side, so retries are safe even for money-moving operations.

How does a model fallback cascade work?

Try the preferred model first (typically the highest-quality tier). On failure or rate limit, fall to the next-tier model (often a cheaper sibling). If all model providers fail, fall to a cached or templated response. The cascade is a single sequence per request, not a loop.

AI Agent Fallback and Retry: A 2026 Playbook for Idempotency, Backoff, and Model Cascades

Naive retries amplify outages; smart retries absorb them. The Google SRE book defines a retry budget so retries can't exceed a fixed fraction of normal load (Google SRE, ch. 22). For AI agents the same logic applies in three layers: transport retry, tool retry, and model fallback. Wrap each with a circuit breaker. Never retry irreversible side effects without idempotency. Detect poisoned pills with a max-attempt-per-input cap.

Defaults to copy and adjust later: three retries, 100ms base delay, 30s cap, plus or minus 25 percent jitter, retry budget at 10 percent of normal QPS. Those are the same defaults AWS recommends for client SDKs (AWS Architecture, 2015) and Stripe's idempotency pattern still defines the bar for safe write retries (Stripe API, 2024).

The three layers of retry

Most agent code retries at one layer and calls it done. Production needs all three. Transport retry covers HTTP-level failures: 5xx, timeouts, network errors. Exponential backoff with jitter, three attempts, cap 30 seconds. Tool retry covers tool-call failures: rate limits, transient business errors, "try again". Idempotency key required for any write. Model fallback covers model-provider failures: rate limit, content policy refusal, hard timeout. Cascade to a different model tier, then to a cached or templated response.

Exponential backoff and jitter, the math

The math: delay = min(cap, base × 2^attempt), then sample uniformly in the jitter range. AWS's "Exponential Backoff and Jitter" recommends full jitter, where each delay is a uniformly random value between 0 and the current exponential ceiling. Most teams get jitter wrong by using a fixed plus-or-minus on the deterministic delay, which still leaves a synchronized wavefront. Full jitter de-synchronizes the herd.

Base 100ms, cap 30s, full jitter is the safe default for AI provider clients.
Max three attempts on transport-level retry. Higher counts add cost not success.
Different jitter window per region if you fan out across regions. Otherwise regions sync up.
Honor Retry-After headers when the provider returns one. Anthropic and OpenAI both send them on 429.

Idempotency keys, the single best lever for safe tool retries

Stripe popularized the pattern: a unique key the client sends with the original request and any retry. The server de-duplicates on the key, so retries are safe even for money-moving operations. The agent should generate the key once per tool invocation (UUIDv4 or hashed inputs) and pass it on every retry of that invocation, never a fresh key per attempt.

Key per logical tool invocation, not per HTTP attempt. Same key across all retries of the same call.
Key lifetime ≥ retry window. If you may retry an hour later, the server must remember the key an hour later.
Key in the request body, not just headers. Some proxies strip headers; bodies survive.
Idempotent-by-default tools refuse without a key. Better to fail loudly than silently double-charge.

Circuit breakers: when to stop retrying entirely

A circuit breaker tracks recent failures per downstream. When failure rate crosses a threshold within a window, the breaker opens: subsequent calls fail fast without hitting the downstream. After a cooldown, the breaker enters half-open and admits a probe call to test recovery. Hystrix-style three-state breakers (closed, open, half-open) are the canonical pattern; resilience4j and Polly carry the design forward in JVM and .NET respectively (resilience4j, 2024).

Per-downstream breaker, not a global one. Anthropic failing should not stop OpenAI calls.
Window-based, not absolute count: 30 percent failure in 60 seconds, not "10 failures".
Cooldown 10 to 30 seconds before half-open probe.
Open breaker triggers the fallback, not just a 503 to the caller.

Model fallback cascades: Opus to Sonnet to Haiku to cached

The model cascade is the most agent-specific retry pattern. Try the preferred model. On failure (rate limit, content policy, hard timeout, breaker open), fall to the next tier. Last resort: a cached prior response, a templated response, or "I cannot help right now, here is the human escalation path". The cascade is a single sequence per request, not a loop. Each tier records its own outcome in the trace so post-mortems can attribute incidents correctly.

Primary tier: the model you would pick if everything worked.
Secondary tier: a cheaper sibling from the same vendor (Sonnet, GPT-4o-mini).
Tertiary tier: a different vendor (provider redundancy beats family redundancy on real outages).
Quaternary tier: cached / templated / safe-fail response.

Retry budgets: Google SRE's lever against retry storms

A retry budget caps the system-wide retry rate as a fraction of normal request volume. 10 percent is the SRE-book default. If retries exceed the budget, the client refuses to retry until normal rate resumes. The budget is the difference between a blip that recovers and a retry storm that DDoSes your own provider.

When NOT to retry

Irreversible side effects without idempotency. Money sent, email sent, public post.
Content-policy refusals. The model means it. Retrying the same prompt does not change the answer; rewrite or escalate.
Validation errors (4xx that are not 429). The request is malformed; retrying does not unmalform it.
Past a retry budget. The next call has the same odds the last one did.
"User is the bug" cases. The input is malformed at the user layer; surface the error, do not loop.

Poisoned-pill detection: when one bad input keeps trying forever

A "poisoned pill" is an input that the agent will keep retrying because the failure looks transient but is permanent. The defense is a per-input attempt counter that survives across runs. After three lifetime attempts on the same input ID, the input is quarantined and a human is notified. Without this, a single broken input can dominate the retry budget for the whole fleet.

Retries-mask vs retries-cure

The question every quarter: is the retry policy hiding a real reliability problem? Track retry rate per downstream. If the rate is rising and success rate is stable, the retries are working as a Band-Aid over a degraded downstream. Fix the downstream, do not crank the retry count. The metric to watch is "fraction of successful responses that required ≥1 retry". If that climbs, something underneath is sick.

Durable execution and queued retries

When retry windows exceed the request lifetime, move retries out of the request path entirely. Durable execution frameworks (Temporal, AWS Step Functions, Cloudflare Durable Objects + queues) take the work, persist it, and replay it across hours or days with retry policies attached. For agents handling long-running workflows like onboarding sequences and multi-step approvals, this is the difference between resilient and brittle.

Common production patterns we see at scale

Four patterns that recur across teams that get retries right. First, retries are configured per capability, not per agent. A "send email" capability has aggressive idempotency and zero transport retries (the SMTP provider already retries); a "search the web" capability has loose retries with backoff. Burying both behind the same policy is a common cause of duplicate emails plus dropped web searches.

Second, retry telemetry is its own dashboard. Retry rate per downstream, retry success rate, time-to-first-success after retry, and percent of successful responses that required at least one retry. Without these four metrics, retries silently mask a degrading dependency until it falls off the cliff during a peak hour. The cliff event is then attributed to "the load" when the slope was visible weeks earlier in the retry telemetry.

Third, the model fallback cascade is exercised, not just configured. Once a quarter, a planned drill takes the primary model offline and the team watches the cascade route through secondary and tertiary tiers. The first drill always finds something broken: a missing API key for the secondary, a stale prompt schema for the tertiary, a templated fallback that no longer matches the new product copy. The drill is cheaper than the real outage.

Fourth, retries always log a unique retry_reason. "Rate limited" is different from "5xx" is different from "content policy" is different from "tool returned 422". Aggregating retries by reason makes the dominant cause obvious in 30 seconds; without the label, an engineer spends 30 minutes guessing.

Anti-patterns to avoid

Infinite retry loops with no budget. The classic "we will just try again" approach that turns a brief degradation into a billing event.
Same backoff defaults for all tools. A search API and a payment API have different retry economics. Treat them differently.
Retrying on 4xx. 4xx errors are not transient. The next attempt fails the same way. Exception: 429 (rate limit) and 408 (timeout) are retry-safe with backoff.
Idempotency key per HTTP attempt. The whole point of the key is that it survives retries. Generating a fresh key per attempt defeats the de-duplication.
Retry policies inside the LLM prompt. "If this fails, try again" in natural language is not enforcement. The platform enforces; the prompt asks nicely.

FAQ

How many retries should an AI agent attempt?: Three at transport, three at tool with an idempotency key, one cascade at model. Higher counts add cost not success.
What is exponential backoff with jitter?: Doubling delays with a randomized component. AWS recommends full jitter: each delay uniformly random between 0 and the current ceiling. Base 100ms, cap 30s.
When should I NOT retry?: Irreversible writes without idempotency, content-policy refusals, validation errors, past a budget, or "user is the bug" cases.
What is an idempotency key?: A unique key the agent passes on the original call and every retry; the server de-duplicates server-side. Same key across retries, not a fresh one per attempt.
How does a model fallback cascade work?: Try the preferred model; on failure, fall to a cheaper or different-vendor sibling; last resort is a cached or templated response.

Closing the loop

Retries are infrastructure. They belong in a library and a policy file, not scattered across agent code. Pick defaults, write the policy down, instrument the retry rate, and review the policy quarterly. Related: error handling and rollback, blast radius control, and agent monitoring.

Sources

Google SRE Book, "Addressing Cascading Failures" (Ch. 22), sre.google
AWS Architecture, "Exponential Backoff and Jitter", aws.amazon.com
Stripe, "Idempotent requests", stripe.com/docs
resilience4j, "Circuit breaker", resilience4j.readme.io
Anthropic, "Error handling", 2025, docs.anthropic.com
OpenAI, "Rate limits and retries", 2025, platform.openai.com