A fallback chain is an ordered list of backup steps an AI agent runs when its first approach fails. The pattern is primary, then secondary, then tertiary: try the preferred method, and if a defined failure condition is met, move to an alternate tool, then an alternate model, and finally a human handoff. It differs from a simple retry because each level is a genuinely different approach, not the same call repeated. To set one up you define the ordered steps, the conditions that trigger a move to the next level, a limit so the chain cannot loop forever, and a log that records which branch actually ran.
This guide covers what a fallback chain is, how it differs from retrying, how to define each level, how to detect the failures that should trigger a fallback, how to order and bound the chain, and how to log the branch so you can tell whether your primary path is healthy.
What a fallback chain is
An agent takes actions to reach a goal: it calls a tool, reads a result, reasons about it, and decides what to do next. Any of those actions can fail. The tool can be down, the input can be missing, the model can refuse or return something unusable. A fallback chain is the agent's answer to "what do I do when the way I wanted to do this does not work?"
The chain is ordered by preference. The primary step is the cheapest, fastest, or most accurate method, the one you want to run most of the time. Each lower level is a backup you would rather not use, but which still produces an acceptable outcome. The last level is almost always a graceful exit: hand the task to a person, or stop and report the failure clearly, rather than guess and return something wrong.
This is a form of graceful degradation. The agent does not collapse the moment its preferred path breaks; it steps down to a less ideal but still functional path, and only stops when every defined option is exhausted. Done well, the user often does not notice that anything went wrong, because the backup quietly delivered a usable result. For the broader picture of how agents recover from problems, the post on agent fallback and retry covers the family of recovery patterns this fits into.
Fallback versus a simple retry
A retry and a fallback look similar from the outside, since both run after something fails, but they solve different problems and confusing them is a common source of fragile agents.
A retry repeats the exact same step, with the same tool and the same input, on the assumption that the failure was transient. A network blip, a rate limit, a momentary timeout: these often clear on a second or third attempt. Retries usually pair with a short backoff, waiting a little longer between attempts so a struggling service has room to recover.
A fallback changes the approach. It assumes that doing the same thing again will fail the same way, so it switches to a different tool, a different model, or a different strategy entirely. You reach for a fallback when the failure is structural rather than transient: a missing permission, an empty result, a refusal, a consistently low-confidence answer.
The two compose. A robust step usually looks like a few retries first, and only if those retries are spent does the agent move down the fallback chain. Retry handles flaky infrastructure; fallback handles an approach that does not work. Mixing them up leads to one of two failure modes: retrying a permission error forever, which wastes time and money, or falling back instantly on a transient blip, which abandons the good primary path far too early.
Defining the primary, secondary, and tertiary steps
A practical fallback chain has three levels. More than that rarely helps, for reasons covered in the ordering section below. Define each level so it is meaningfully different from the one above it.
- Primary, the alternate-free path: your best method. This is the tool or model you trust most for the job: the specialized API, the fastest provider, the most accurate retrieval source. Most requests should finish here.
- Secondary, an alternate tool or source: a different way to get the same outcome. If the primary is a paid enrichment API, the secondary might be a free public lookup. If the primary is a specific document store, the secondary might be a web search. The point is a different dependency, so it does not fail for the same reason the primary did.
- Tertiary, an alternate model or a human handoff: if a different tool still does not produce an acceptable result, change who or what is reasoning over the problem. That can mean a more capable model on a hard reasoning task, or, when the task genuinely needs judgment the agent cannot supply, a handoff to a person.
The human handoff deserves its own note. It is not a failure of the design; it is the design working. An agent that knows when to stop and ask a human is safer than one that pushes through on a guess. The mechanics of pausing for a person, surfacing the right context, and resuming after their decision are covered in how to add human in the loop to an agent. The principle that each step should be scoped to what the agent is actually allowed to do connects to agent tool use, and if your agent can reach several tools, deciding which is primary and which is the alternate is part of giving an agent multiple tool access.
Detecting a failure that should trigger fallback
A fallback chain is only as good as its ability to notice failure. If the agent cannot tell that a step failed, it never moves to the backup, and a wrong answer sails straight through. There are four signals worth detecting explicitly.
- Errors: the most obvious signal. A tool returns a 4xx or 5xx status, throws an exception, or returns a structured error object. Distinguish them: a 429 or 503 suggests a retry, while a 401, 403, or 404 is structural and should fall back. Do not treat every error the same way.
- Empty results: a quieter failure. The call succeeds with a 200 status but returns nothing useful: an empty list, a null field, a record with no match. If a result was required for the next step, an empty result is a failure even though no error was thrown, and it should trigger a fallback to a different source.
- Low confidence: the model produces an answer but signals it is unsure, or your own check scores the answer below a threshold. A retrieval step that returns documents with low relevance scores, or a classification with a low probability, should drop to a stronger method rather than pass a shaky answer downstream. Grounding the output against a source helps catch this class, as the post on agent safety and guardrails describes.
- Timeouts: the step does not finish within a bound you set. A hard per-step timeout protects the whole chain from hanging on one slow dependency. After retries within that budget are spent, a timeout becomes a fallback trigger.
Make these checks explicit in the workflow, not implicit in the model's judgment. An agent that only "decides" something went wrong by reasoning over the output will miss failures that look superficially fine. A coded check on the status, the result shape, and the confidence score catches them reliably. For tracing exactly where a tool step breaks so you can write the right detection, the guide on debugging agent tool errors is the practical companion to this section.
Ordering the chain so each level is different
Order matters, and the single most important rule is that each fallback level should fail for a different reason than the one above it. If your secondary depends on the same API, the same credentials, or the same data source as the primary, it will go down at the same moment and the fallback buys you nothing.
Rank the levels by cost and quality, best first, but choose them for independence. A strong chain looks like this: a fast specialized tool, then a different provider or a general tool, then a different model or a human. Each rung shares as little as possible with the rung above. A weak chain is the same provider with a slightly different endpoint, which collapses together under the very outage you built the chain to survive.
Order also reflects priority. Put the method you most want to run at the top, even if it is the one most likely to fail occasionally, because when it works it gives the best outcome. The lower levels are deliberately more conservative: cheaper, more available, or more cautious, accepting a lower-quality result in exchange for getting any result at all. The judgment is the trade between an ideal answer that sometimes fails and a serviceable answer that almost always succeeds.
Setting limits so the chain ends
A fallback chain must end. Without limits, an agent can cycle through levels, loop back, retry, and burn time and credits without ever finishing. Set three kinds of limits.
- A depth limit: the chain has a fixed number of levels and a defined last step. When the final level is reached, the agent stops or hands off. It never invents a new approach beyond the ones you defined.
- A retry cap per level: each level retries a small, fixed number of times before the agent moves on. Two or three attempts with backoff is typical. An uncapped retry is how an agent gets stuck forever on a single failing step.
- A total budget: a ceiling on wall-clock time, total tool calls, or credits spent across the whole chain. When the budget is hit, the agent stops at wherever it is and reports, rather than chasing a result past the point where it is worth the cost.
The last level should always be a safe terminal state: a human handoff, or a clean stop with a clear message about what failed and what was tried. An agent that exhausts its chain and then guesses is more dangerous than one that stops and says it could not complete the task. When a partial action has already happened earlier in a multi-step workflow, the agent may also need to undo it before stopping, which is the territory of agent error handling and rollback. Keeping each level inside its rate budget so retries do not hammer a recovering service connects to rate limiting your agent.
Logging which branch ran
If you take one operational habit from this guide, take this: log which branch of the chain actually ran. The fallback chain is a safety net, and a safety net is only safe if you know how often you are landing in it.
For each request, record the level that produced the final result, what triggered any fallback, how many retries were spent, and the final outcome. That record answers the question that matters: is the primary path healthy, or is the agent quietly limping along on a backup for a large share of requests? A fallback that runs on five percent of traffic is doing its job. A fallback that runs on forty percent of traffic means the primary is effectively broken and the chain is hiding the problem instead of surfacing it.
This is where fallback design meets observability. Without a per-branch log, a degraded primary looks identical to a healthy one from the outside, because the user still gets a result. The metrics to watch, fallback rate, the distribution across levels, and the trend over time, are part of agent monitoring and observability. Treat a rising fallback rate as an alert to fix the root cause, not as a sign the safety net is working well.
A worked example
Consider an agent that enriches a new lead with a company description. The chain might be defined like this.
- Primary: call a paid enrichment API keyed on the company domain. Fast and accurate when it has the record. Retry twice on a 429 or 503; fall back on a 404, an empty result, or two timeouts.
- Secondary: run a web search on the company name and domain, then have the agent summarize the top results. Different dependency, different failure mode. Fall back if the search returns nothing relevant or the summary scores below the confidence threshold.
- Tertiary: hand off to a person with the lead, the domain, and a note that automated enrichment found nothing, so a human can decide whether to research manually or mark the lead incomplete.
Bound it: each level retries at most twice, the whole chain has a sixty-second budget, and the last level is a handoff rather than a guess. Log it: every run records whether the result came from the API, the web search, or the human, plus what triggered each step down. After a week of logs you know exactly how often the paid API is enough, and whether the secondary is carrying more load than it should. That tells you whether to fix the API integration or accept the chain as it stands.
This same shape, primary tool, alternate tool, then a model or human step, transfers to almost any agent task: a research lookup, a data write, a classification, a content draft. The categories change; the structure does not. If the underlying idea of an agent choosing and recovering across steps is still fuzzy, what is an AI agent and the glossary lay the groundwork this guide builds on.
How Gravity handles fallback chains
Gravity is an AI agent platform. The expert-built agents that run on Gravity ship with fallback chains already designed in, so you do not assemble retry caps, alternate tools, and handoff rules yourself. When a primary tool is unavailable, returns nothing, or produces a low-confidence result, the agent steps down to a defined backup, and only stops on a clear failure rather than a wrong guess.
The builders who build and maintain agents for Gravity choose which method is primary, which is the alternate, and where a human handoff belongs, then keep those chains tuned as tools and providers change. You describe what you need in plain words; the agent handles the recovery logic behind the scenes. Pay per use: $1 equals 1,000 credits, and you only pay when the agent runs, including the runs where a fallback quietly saved a result.
The practical effect is that a tool outage or an empty lookup does not become a failed task on your end. The agent degrades gracefully, finishes the job by another route when one exists, and tells you plainly when none does, which is exactly the behavior a fallback chain is meant to produce.
FAQ
What is the difference between a retry and a fallback chain?
A retry repeats the same step with the same tool, hoping a transient problem clears. A fallback chain switches to a different approach when the first one fails: a different tool, a different model, or a human. Retry handles flaky infrastructure; fallback handles a strategy that does not work. Most production agents use both: a few retries on a step, and a fallback if those retries are exhausted.
How many fallback levels should an agent have?
Three is a common shape: a primary step, one alternate that is genuinely different from the primary, and a human handoff as the last resort. Adding more levels rarely helps, because a second alternate that is similar to the first tends to fail for the same reason. Keep the chain short, make each level meaningfully different, and end on a handoff or a safe stop so the agent never loops indefinitely.
What should trigger a fallback instead of a retry?
Trigger a fallback when the failure is unlikely to resolve on its own: a 4xx error such as a permission or not-found response, an empty result when a result was required, a low-confidence answer below your threshold, or a hard timeout after retries are spent. Trigger a retry for transient signals such as a 429 rate limit, a 503, or a network reset. The distinction is whether trying the same thing again could plausibly succeed.
Does an alternate model always improve the result?
No. An alternate model helps when the first model failed for a model-specific reason: it refused, hit a context limit, returned low confidence, or was unavailable. It does not help when the failure is in the data or the tool, because a second model fed the same bad input usually produces the same bad output. Use a model fallback for reasoning and availability problems, and a tool or data fallback for input and access problems.
Why should an agent log which fallback branch ran?
Logging the branch tells you how often the primary path works versus how often the agent is quietly limping along on a backup. If the secondary branch runs on a large share of requests, the primary is effectively broken and the fallback is hiding it. The log should record which level ran, what triggered the fallback, and the final outcome, so you can fix the root cause instead of leaning on the safety net forever.