Most agent platform outages I have seen were not catastrophic. A model provider had an incident; a region's vector store throttled; a deploy clobbered a prompt store; a tenant's run history was deleted by a buggy retention job. Each was recoverable in minutes if a plan existed, and in hours if it did not. The plan does not have to be elaborate, but it does have to be written down, owned by someone, and tested. Companion to incident response, uptime and reliability, and data residency.
This piece is the disaster recovery (DR) plan template I use for agent platforms. It covers scope, RTO and RPO targets, the failure classes worth planning for, failover mechanics, backups, the runbook structure, and the drill cadence auditors expect.
What a DR plan covers for agent platforms
Six component classes need DR coverage on an agent platform. Each has different recovery characteristics.
- Model providers. External services. Failure is usually a provider-side incident or a regional dependency.
- Prompt and bundle store. Internal. Source of truth for agent behavior. Loss is recoverable from source control if discipline holds.
- Vector and retrieval indexes. Internal. Expensive to rebuild from source. The component teams forget.
- Run history and audit logs. Internal. Required for compliance and customer support. Sensitive to RPO.
- Identity and tenant config. Internal. Small but high-value. Loss is catastrophic for multi-tenant platforms.
- Orchestrator code and infrastructure. Internal. Recoverable from source control and IaC.
Each gets a row in the DR matrix: component, owner, RTO, RPO, backup mechanism, failover mechanism, drill cadence.
RTO and RPO targets
RTO (Recovery Time Objective) is the maximum acceptable downtime. RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time. The NIST contingency planning guide formalizes both as the boundary between "acceptable" and "business-impacting" (NIST SP 800-34 Rev 1, 2010).
Defaults that fit most agent platforms:
- End-user agent runs: RTO 15 minutes. RPO not applicable (runs are stateless; on failover, queued runs replay).
- Run history and audit logs: RTO 60 minutes. RPO 5 to 15 minutes (acceptable to lose the last N minutes of writes).
- Prompt and bundle store: RTO 30 minutes. RPO 0 (everything reproducible from source control).
- Vector indexes: RTO 2 to 4 hours (depending on size). RPO 24 hours typical (daily snapshot is normal).
- Identity and tenant config: RTO 30 minutes. RPO 5 minutes.
If your business says "no, the run history RTO is 5 minutes, not 60", the architecture has to follow: active-active replication instead of restore-from-backup. Pick targets, then pick architecture, not the other way around.
Failure classes worth planning for
Six failure classes covers 95 percent of real incidents on agent platforms.
- Model provider outage. OpenAI, Anthropic, Google have all had multi-hour incidents in the past 12 months. The OpenAI November 2024 incident took the API down for several hours; the Anthropic December 2024 incident degraded Claude responses (OpenAI status page, Anthropic status page).
- Region failure. A cloud region loses power, network, or AZ-wide capacity. Rare but recoverable only if you planned cross-region.
- Data corruption. A buggy job rewrote a prompt store, deleted an index, truncated a log. The recovery requires versioned snapshots.
- Accidental deletion. Human or scripted. The hardest to defend against because it bypasses normal access controls.
- Tenant-blast incident. A noisy or compromised tenant's traffic takes down a shared component. The DR mechanism is isolation rebuild, not full restore.
- Vendor termination. The vector DB provider or the model provider winds down a service. Recovery is migration, not failover, but the planning happens before the announcement.
Multi-provider model failover
Multi-provider model routing is the cheapest insurance in the stack. The mechanism.
- Pin at least two providers per capability. Reasoning capability: Claude Sonnet + GPT-4 class. Embeddings: OpenAI + a self-hosted Sentence-Transformer fallback. Image: Gemini + DALL-E.
- Health-check at the routing layer. Detect provider failure via error rate above threshold or p95 latency above threshold over a 60-second window. Auto-route to the fallback when crossed.
- Mark runs as degraded. A run executed on the fallback model gets a "degraded:true" flag in the trace. The eval suite picks this up and compares output quality after recovery.
- Provider-agnostic prompts. Prompts written for "the model can call a tool that does X" rather than "the assistant function should..." survive the swap. Tool-call format conversion (OpenAI-style to Anthropic-style) happens in the routing layer.
Real-world: in October 2024, Bedrock's Anthropic offering experienced regional issues; teams with a provider-agnostic router stayed up by failing over to direct Anthropic API or Vertex AI's Claude. Teams without one had a multi-hour outage they could not control.
Backups: prompts, indexes, run history
The backup matrix.
Prompt and bundle store. Source-of-truth in git. The deployed bundles in object storage are immutable; loss is recoverable via redeploy. Snapshot the bundle storage daily; cross-region replicate. RPO 0 because source control is the ground truth.
Vector indexes. Two tiers. The hot index in your vector DB; a daily snapshot to object storage; cross-region replication. Restore is measured in minutes-per-GB; sized for the largest tenant. For a 100 GB index, plan for 60 to 120 minute restore. The cold storage is what saves you when the live index is corrupted or deleted.
Run history and audit logs. Streamed to a write-optimized store with at-least-once delivery guarantees. Cross-region replication if RPO requires it. AWS recommends combining storage-side replication with point-in-time recovery for stateful data stores like DynamoDB used in agent platforms (AWS DR whitepaper, 2025).
Identity and tenant config. Small data, high value. Point-in-time backups daily; encrypted; restore-tested monthly. The "ten minutes to rebuild the tenant routing table" drill is one every platform team should run.
Regional failover
Three architectures, ascending in cost and complexity.
- Backup-and-restore. Single primary region. Backups replicated to a secondary. On disaster, spin up the secondary from backups. RTO measured in hours. Cheap.
- Warm standby. Reduced-capacity environment running in the secondary region. On failover, scale up. RTO measured in tens of minutes. Moderate cost.
- Active-active. Both regions serve traffic. On failover, the surviving region takes 100 percent. RTO near-zero. Highest cost and complexity.
For most agent platforms in years 1 to 2, warm standby is the right tradeoff: the cost is bounded, the RTO is measured in minutes, and the operational complexity is manageable. Active-active is correct only once you have the traffic and the team to run it.
The DR runbook
The runbook is the actionable artifact. Sections:
- Activation criteria. Who declares a DR event, and what conditions trigger it. Usually a sustained outage past the RTO budget.
- Roles. Incident commander, communications lead, technical leads per component class.
- Component playbooks. For each component class, a step-by-step recovery procedure with the exact commands.
- Communication tree. Who gets notified internally, what status page text gets posted, how customers are updated.
- Decision points. "After 30 minutes without progress, escalate to vendor support and consider X". Time-boxed.
- Recovery validation. Smoke tests that confirm the platform is back; quality evals that confirm the platform is back at the right quality.
- Postmortem template. Triggers within 5 business days of the event.
Drills and audit evidence
A plan you have never tested is a wish. The drill cadence.
- Monthly: Single-component restore drill. Restore last week's vector index from cold storage to a non-production environment. Time it; verify checksums.
- Quarterly: Full failover drill. Simulate a regional outage in a staging environment; run through the runbook end-to-end; measure RTO and RPO.
- Annually: Adversarial drill. The DR scenario is not pre-announced; the on-call team runs the playbook cold.
Auditors want evidence of testing, not just the written plan. SOC 2 Common Criteria CC7.5 requires evaluation of recovery testing; ISO 27001 Annex A.17 has equivalent language (ISO 27001). Keep the drill records: date, scenario, runbook used, RTO achieved, RPO achieved, issues found, follow-up tickets.
FAQ
- What is a disaster recovery plan for an AI agent platform?
- A documented set of procedures and supporting infrastructure that lets the platform resume serving agent runs after a major failure such as a provider outage, region failure, data corruption, or accidental deletion.
- What are RTO and RPO for agent platforms?
- RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss measured in time. Typical agent platform targets: RTO 15 to 60 minutes for end-user runs, RPO 5 to 15 minutes for run history.
- How do you failover when the model provider is down?
- Multi-provider routing with a fallback registry. Pin model snapshots from at least two providers. Detect failure via error rate or latency threshold; route to the fallback; flag runs as degraded for later quality comparison.
- Do I need to back up prompts and indexes?
- Yes. Prompts are IP and the main behavior driver; indexes are expensive to rebuild. Both need versioned snapshots, cross-region replication, and periodic restore drills.
- How often should I test the DR plan?
- Monthly for single-component restores. Quarterly for full failover drills. Annually for unannounced adversarial drills. Auditors expect documented evidence of testing, not just a written plan.
- What does the auditor want to see in a DR plan?
- The written plan, the RTO and RPO targets, the named owner, the last drill date and result, and the follow-up tickets opened from issues found.
Sources
- NIST, "SP 800-34 Rev 1: Contingency Planning Guide for Federal Information Systems", 2010, csrc.nist.gov
- AWS, "Disaster Recovery of Workloads on AWS: Recovery in the Cloud", 2025, docs.aws.amazon.com
- Google Cloud, "Disaster recovery planning guide", 2025, cloud.google.com
- ISO, "ISO/IEC 27001:2022 Information security management", iso.org
- OpenAI, "Status page", status.openai.com
- Anthropic, "Status page", status.anthropic.com
