What is an on-call runbook for an AI agent platform?

A written playbook the on-call engineer follows when an alert fires. It contains the alert taxonomy, triage steps for each, escalation paths, mitigation commands, and the postmortem trigger. A good runbook turns a 2 AM page into a 10-minute fix without paging anyone else.

What alerts should fire for an agent platform?

Five core alerts: error rate above threshold, latency p99 above threshold, model provider unhealthy, queue depth above SLO, and per-tenant rate-limit hits. Each maps to a specific runbook section with named mitigations.

How long should an agent platform on-call rotation be?

One week is standard. Shorter than a week and handoffs cost too much. Longer than a week and pager fatigue dominates. A 7-day primary plus 7-day secondary rotation covers most platforms.

What are the most common agent platform incidents?

Model provider outage, rate-limit exhaustion, prompt regression after deploy, vector index degradation, and tool integration failures. Each has a distinct signature and a documented mitigation in the runbook.

When should the on-call escalate?

When the issue is outside the runbook, when the mitigation has not worked after the time budget (usually 15 to 30 minutes), or when scope exceeds a single component. The escalation tree should be one page, named, and current.

What goes in the runbook versus the postmortem?

The runbook is for the next on-call engineer who hits the same alert. The postmortem is for the team learning from a specific incident. Findings from postmortems get folded back into the runbook as new mitigations or new alert tuning.

AI Agent On-Call Runbook: Incident Playbook for Agent Operators

The on-call runbook is the operational artifact that makes the platform survivable. A well-written runbook turns a 2 AM page into a 10-minute fix the on-call engineer can execute alone. A bad runbook forces a multi-person war room every time an alert fires. The goal of this piece is the structure, the alert taxonomy, the mitigation patterns, and the escalation tree that an agent platform on-call rotation needs. Companion to incident response, observability dashboards, and monitoring.

Rotation structure

A workable rotation for an agent platform team of 4 to 12 engineers.

Primary on-call. One week, follow-the-sun if the team is split across regions. Primary takes the page first.
Secondary on-call. One week, paged if primary does not acknowledge within 5 minutes, or escalated to by the primary.
Incident commander rotation. Separate from the technical on-call, optional for small teams. The IC owns the response coordination during major incidents.
Specialist escalation. Domain experts (vector DB owner, integration owner, security) reachable but not paged by default.

The Google SRE workbook recommends rotations be sized so each on-call engineer is paged no more than twice per shift on average; sustained higher rates indicate the alert thresholds are wrong or the underlying system needs work (Google SRE Book, Chapter 11, 2016).

Alert taxonomy

Five alert classes are sufficient for a starting agent platform.

Error rate above threshold. 5xx responses or terminal agent run failures above 2 percent of traffic over a 5-minute window. Trigger: page.
Latency p99 above threshold. Run latency p99 above the SLO (e.g., 30 seconds) sustained for 5 minutes. Trigger: page if customer SLO at risk, otherwise ticket.
Model provider unhealthy. Per-provider error rate above 10 percent or sustained 429s. Trigger: page if primary provider, ticket if fallback.
Queue depth above SLO. Queue wait time at p99 above the SLO. Trigger: page if SLO at risk.
Per-tenant anomaly. Single tenant generating outsized rate-limit hits, run failures, or cost. Trigger: ticket plus tenant comms.

Tertiary alerts (drift in eval scores, cost overruns, capacity utilization) get tickets, not pages. Pages are reserved for "customers are seeing this right now".

Triage flow

The standard flow once the page lands.

Acknowledge within 5 minutes. Even from a phone. Acknowledgment stops further pages and signals to the secondary that primary is engaged.
Open the dashboard linked in the alert. Every alert should link to the relevant dashboard panel. If it does not, fix the alert after the incident.
Match symptom to runbook section. The runbook section names map 1:1 to alert names.
Execute the documented mitigation. The first command is usually "check status pages of upstream providers" - the cheapest exclusion.
Verify the mitigation worked. Watch metrics for at least one full window (5 minutes typical) before declaring resolved.
If 15 minutes pass without resolution, escalate. Time-boxed. No exceptions.

Common failure classes and mitigations

Each section below maps to one runbook page. The structure: symptom, likely cause, mitigation commands, verification.

1. Model provider outage. Symptom: error rate spike, latency spike, provider status page lit. Cause: provider-side. Mitigation: confirm via provider status page; engage multi-provider failover (route to fallback); post a status page entry. Verification: error rate drops within 2 minutes of failover; latency returns to baseline. See DR plan for the failover mechanics.

2. Rate-limit exhaustion. Symptom: spike in 429 responses, queue depth growing. Cause: traffic exceeds provisioned TPM. Mitigation: enable spillover to secondary provider; throttle non-critical traffic (background jobs, batch evals); page capacity team for an emergency quota request. Verification: 429 rate returns to baseline; queue drains. Capacity planning covers the prevention side.

3. Prompt regression after deploy. Symptom: error rate spike right after a deploy. Cause: bad prompt or bundle. Mitigation: roll back via the blue-green pointer flip; if warm bundle is still loaded, rollback is sub-second. Verification: error rate returns to pre-deploy baseline.

4. Vector index degradation. Symptom: retrieval relevance drops; user complaints about wrong citations. Cause: index corruption, partial ingest failure, or staleness past SLO. Mitigation: switch index pointer to the prior known-good snapshot; trigger re-index. Verification: retrieval relevance metric recovers; manual sample queries return expected results.

5. Tool integration failure. Symptom: tool-call error rate spike for a specific tool. Cause: upstream API change, credential expiry, network issue. Mitigation: disable the affected tool in the registry (returns "tool unavailable, try later" instead of cascading failures); page integration owner. Verification: tool-call error rate drops; non-tool flows unaffected.

6. Noisy tenant. Symptom: single tenant's traffic dominates rate-limit metrics. Cause: tenant agent looped, tenant traffic spike, or compromised credentials. Mitigation: apply per-tenant rate-limit override; contact tenant; if security-related, revoke credentials. Verification: shared-pool metrics return to normal.

Escalation tree

The escalation tree should be one page, named, current. The format.

Primary on-call → Secondary on-call. If no ack in 5 minutes; or primary requests help.
Secondary → Domain owner. When the issue maps to a known component (vector DB, integrations, security).
Domain owner → Director of engineering. When scope is multi-component, customer-impacting, or extending past 2 hours.
Director → CEO. Customer SLO breaches, data incidents, regulatory implications.
Out-of-hours specialist contacts. Named in the runbook, with current phone numbers, validated quarterly.

The escalation contacts get re-verified every quarter. A runbook that references "Alex's old phone number" is not a runbook.

Communication during incidents

The on-call's job during a customer-visible incident is half technical and half communication.

Internal channel update within 5 minutes. Even if all you have is "engaged, investigating".
Status page entry within 10 minutes. If customer-visible. PagerDuty's incident response guide treats this as a non-negotiable for any incident affecting external users (PagerDuty Incident Response, 2025).
15-minute update cadence. Even if there is no new info, post that there is no new info. Silence breeds escalation.
Resolved entry within 10 minutes of fix. Mark resolved on the status page; thank customers; preview the postmortem timeline.

Postmortem trigger

A postmortem is required when any of the following holds.

Customer-visible incident longer than 15 minutes.
Data loss or corruption of any extent.
Security incident or potential one.
SLO breach.
Repeat of a prior incident class.

The postmortem is blameless (focus on systems, not people), due within 5 business days, and produces at least one runbook update or alert tuning. The Google SRE postmortem template is a sane starting point (Google SRE Book, Chapter 15, 2016).

Keeping the runbook healthy

Three rules keep the runbook from rotting.

Every postmortem produces at least one runbook update. A new mitigation, a tightened threshold, a clarified escalation path. If a postmortem produces zero runbook changes, either the incident was already covered (great) or the team is not learning (a problem).

Quarterly dry-run. One on-call rotation per quarter, the team runs through the runbook cold against a synthetic scenario. The exercise catches stale phone numbers, broken dashboard links, and mitigations whose commands no longer work.

Onboarding test. A new engineer takes the runbook and a fake page; can they resolve a class-3 incident without help? If not, the runbook is not yet good enough for them to take primary.

FAQ

What is an on-call runbook for an AI agent platform?: A written playbook the on-call engineer follows when an alert fires. Contains alert taxonomy, triage steps, escalation paths, mitigation commands, and postmortem triggers.
What alerts should fire for an agent platform?: Error rate above threshold, latency p99 above threshold, model provider unhealthy, queue depth above SLO, and per-tenant rate-limit hits. Each maps to a runbook section.
How long should an agent platform on-call rotation be?: One week is standard. Shorter loses handoff context; longer drives pager fatigue.
What are the most common agent platform incidents?: Model provider outage, rate-limit exhaustion, prompt regression after deploy, vector index degradation, and tool integration failures.
When should the on-call escalate?: When the issue is outside the runbook, when the documented mitigation has not worked within the 15-minute budget, or when scope exceeds a single component.
What goes in the runbook versus the postmortem?: The runbook is for the next on-call engineer who hits the same alert. The postmortem is the team learning from a specific incident, and produces runbook updates as artifacts.

Sources

Google SRE, "Being On-Call", SRE Book Chapter 11, 2016, sre.google
Google SRE, "Postmortem Culture: Learning from Failure", SRE Book Chapter 15, 2016, sre.google
PagerDuty, "Incident Response Documentation", 2025, response.pagerduty.com
Atlassian, "Incident management handbook", 2025, atlassian.com
NIST, "SP 800-61 Rev 2: Computer Security Incident Handling Guide", 2012, csrc.nist.gov