Most agent platform outages I have seen were not catastrophic. A model provider had an incident; a region's vector store throttled; a deploy clobbered a prompt store; a tenant's run history was deleted by a buggy retention job. Each was recoverable in minutes if a plan existed, and in hours if it did not. The plan does not have to be elaborate, but it does have to be written down, owned by someone, and tested. Companion to incident response, uptime and reliability, and data residency.

This piece is the disaster recovery (DR) plan template I use for agent platforms. It covers scope, RTO and RPO targets, the failure classes worth planning for, failover mechanics, backups, the runbook structure, and the drill cadence auditors expect.

What a DR plan covers for agent platforms

Six component classes need DR coverage on an agent platform. Each has different recovery characteristics.

Each gets a row in the DR matrix: component, owner, RTO, RPO, backup mechanism, failover mechanism, drill cadence.

RTO and RPO targets

RTO (Recovery Time Objective) is the maximum acceptable downtime. RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time. The NIST contingency planning guide formalizes both as the boundary between "acceptable" and "business-impacting" (NIST SP 800-34 Rev 1, 2010).

Defaults that fit most agent platforms:

If your business says "no, the run history RTO is 5 minutes, not 60", the architecture has to follow: active-active replication instead of restore-from-backup. Pick targets, then pick architecture, not the other way around.

Failure classes worth planning for

Six failure classes covers 95 percent of real incidents on agent platforms.

  1. Model provider outage. OpenAI, Anthropic, Google have all had multi-hour incidents in the past 12 months. The OpenAI November 2024 incident took the API down for several hours; the Anthropic December 2024 incident degraded Claude responses (OpenAI status page, Anthropic status page).
  2. Region failure. A cloud region loses power, network, or AZ-wide capacity. Rare but recoverable only if you planned cross-region.
  3. Data corruption. A buggy job rewrote a prompt store, deleted an index, truncated a log. The recovery requires versioned snapshots.
  4. Accidental deletion. Human or scripted. The hardest to defend against because it bypasses normal access controls.
  5. Tenant-blast incident. A noisy or compromised tenant's traffic takes down a shared component. The DR mechanism is isolation rebuild, not full restore.
  6. Vendor termination. The vector DB provider or the model provider winds down a service. Recovery is migration, not failover, but the planning happens before the announcement.

Multi-provider model failover

Multi-provider model routing is the cheapest insurance in the stack. The mechanism.

Real-world: in October 2024, Bedrock's Anthropic offering experienced regional issues; teams with a provider-agnostic router stayed up by failing over to direct Anthropic API or Vertex AI's Claude. Teams without one had a multi-hour outage they could not control.

Backups: prompts, indexes, run history

The backup matrix.

Prompt and bundle store. Source-of-truth in git. The deployed bundles in object storage are immutable; loss is recoverable via redeploy. Snapshot the bundle storage daily; cross-region replicate. RPO 0 because source control is the ground truth.

Vector indexes. Two tiers. The hot index in your vector DB; a daily snapshot to object storage; cross-region replication. Restore is measured in minutes-per-GB; sized for the largest tenant. For a 100 GB index, plan for 60 to 120 minute restore. The cold storage is what saves you when the live index is corrupted or deleted.

Run history and audit logs. Streamed to a write-optimized store with at-least-once delivery guarantees. Cross-region replication if RPO requires it. AWS recommends combining storage-side replication with point-in-time recovery for stateful data stores like DynamoDB used in agent platforms (AWS DR whitepaper, 2025).

Identity and tenant config. Small data, high value. Point-in-time backups daily; encrypted; restore-tested monthly. The "ten minutes to rebuild the tenant routing table" drill is one every platform team should run.

Regional failover

Three architectures, ascending in cost and complexity.

  1. Backup-and-restore. Single primary region. Backups replicated to a secondary. On disaster, spin up the secondary from backups. RTO measured in hours. Cheap.
  2. Warm standby. Reduced-capacity environment running in the secondary region. On failover, scale up. RTO measured in tens of minutes. Moderate cost.
  3. Active-active. Both regions serve traffic. On failover, the surviving region takes 100 percent. RTO near-zero. Highest cost and complexity.

For most agent platforms in years 1 to 2, warm standby is the right tradeoff: the cost is bounded, the RTO is measured in minutes, and the operational complexity is manageable. Active-active is correct only once you have the traffic and the team to run it.

The DR runbook

The runbook is the actionable artifact. Sections:

  1. Activation criteria. Who declares a DR event, and what conditions trigger it. Usually a sustained outage past the RTO budget.
  2. Roles. Incident commander, communications lead, technical leads per component class.
  3. Component playbooks. For each component class, a step-by-step recovery procedure with the exact commands.
  4. Communication tree. Who gets notified internally, what status page text gets posted, how customers are updated.
  5. Decision points. "After 30 minutes without progress, escalate to vendor support and consider X". Time-boxed.
  6. Recovery validation. Smoke tests that confirm the platform is back; quality evals that confirm the platform is back at the right quality.
  7. Postmortem template. Triggers within 5 business days of the event.

Drills and audit evidence

A plan you have never tested is a wish. The drill cadence.

Auditors want evidence of testing, not just the written plan. SOC 2 Common Criteria CC7.5 requires evaluation of recovery testing; ISO 27001 Annex A.17 has equivalent language (ISO 27001). Keep the drill records: date, scenario, runbook used, RTO achieved, RPO achieved, issues found, follow-up tickets.

FAQ

What is a disaster recovery plan for an AI agent platform?
A documented set of procedures and supporting infrastructure that lets the platform resume serving agent runs after a major failure such as a provider outage, region failure, data corruption, or accidental deletion.
What are RTO and RPO for agent platforms?
RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss measured in time. Typical agent platform targets: RTO 15 to 60 minutes for end-user runs, RPO 5 to 15 minutes for run history.
How do you failover when the model provider is down?
Multi-provider routing with a fallback registry. Pin model snapshots from at least two providers. Detect failure via error rate or latency threshold; route to the fallback; flag runs as degraded for later quality comparison.
Do I need to back up prompts and indexes?
Yes. Prompts are IP and the main behavior driver; indexes are expensive to rebuild. Both need versioned snapshots, cross-region replication, and periodic restore drills.
How often should I test the DR plan?
Monthly for single-component restores. Quarterly for full failover drills. Annually for unannounced adversarial drills. Auditors expect documented evidence of testing, not just a written plan.
What does the auditor want to see in a DR plan?
The written plan, the RTO and RPO targets, the named owner, the last drill date and result, and the follow-up tickets opened from issues found.

Sources