The on-call runbook is the operational artifact that makes the platform survivable. A well-written runbook turns a 2 AM page into a 10-minute fix the on-call engineer can execute alone. A bad runbook forces a multi-person war room every time an alert fires. The goal of this piece is the structure, the alert taxonomy, the mitigation patterns, and the escalation tree that an agent platform on-call rotation needs. Companion to incident response, observability dashboards, and monitoring.

Rotation structure

A workable rotation for an agent platform team of 4 to 12 engineers.

The Google SRE workbook recommends rotations be sized so each on-call engineer is paged no more than twice per shift on average; sustained higher rates indicate the alert thresholds are wrong or the underlying system needs work (Google SRE Book, Chapter 11, 2016).

Alert taxonomy

Five alert classes are sufficient for a starting agent platform.

  1. Error rate above threshold. 5xx responses or terminal agent run failures above 2 percent of traffic over a 5-minute window. Trigger: page.
  2. Latency p99 above threshold. Run latency p99 above the SLO (e.g., 30 seconds) sustained for 5 minutes. Trigger: page if customer SLO at risk, otherwise ticket.
  3. Model provider unhealthy. Per-provider error rate above 10 percent or sustained 429s. Trigger: page if primary provider, ticket if fallback.
  4. Queue depth above SLO. Queue wait time at p99 above the SLO. Trigger: page if SLO at risk.
  5. Per-tenant anomaly. Single tenant generating outsized rate-limit hits, run failures, or cost. Trigger: ticket plus tenant comms.

Tertiary alerts (drift in eval scores, cost overruns, capacity utilization) get tickets, not pages. Pages are reserved for "customers are seeing this right now".

Triage flow

The standard flow once the page lands.

  1. Acknowledge within 5 minutes. Even from a phone. Acknowledgment stops further pages and signals to the secondary that primary is engaged.
  2. Open the dashboard linked in the alert. Every alert should link to the relevant dashboard panel. If it does not, fix the alert after the incident.
  3. Match symptom to runbook section. The runbook section names map 1:1 to alert names.
  4. Execute the documented mitigation. The first command is usually "check status pages of upstream providers" - the cheapest exclusion.
  5. Verify the mitigation worked. Watch metrics for at least one full window (5 minutes typical) before declaring resolved.
  6. If 15 minutes pass without resolution, escalate. Time-boxed. No exceptions.

Common failure classes and mitigations

Each section below maps to one runbook page. The structure: symptom, likely cause, mitigation commands, verification.

1. Model provider outage. Symptom: error rate spike, latency spike, provider status page lit. Cause: provider-side. Mitigation: confirm via provider status page; engage multi-provider failover (route to fallback); post a status page entry. Verification: error rate drops within 2 minutes of failover; latency returns to baseline. See DR plan for the failover mechanics.

2. Rate-limit exhaustion. Symptom: spike in 429 responses, queue depth growing. Cause: traffic exceeds provisioned TPM. Mitigation: enable spillover to secondary provider; throttle non-critical traffic (background jobs, batch evals); page capacity team for an emergency quota request. Verification: 429 rate returns to baseline; queue drains. Capacity planning covers the prevention side.

3. Prompt regression after deploy. Symptom: error rate spike right after a deploy. Cause: bad prompt or bundle. Mitigation: roll back via the blue-green pointer flip; if warm bundle is still loaded, rollback is sub-second. Verification: error rate returns to pre-deploy baseline.

4. Vector index degradation. Symptom: retrieval relevance drops; user complaints about wrong citations. Cause: index corruption, partial ingest failure, or staleness past SLO. Mitigation: switch index pointer to the prior known-good snapshot; trigger re-index. Verification: retrieval relevance metric recovers; manual sample queries return expected results.

5. Tool integration failure. Symptom: tool-call error rate spike for a specific tool. Cause: upstream API change, credential expiry, network issue. Mitigation: disable the affected tool in the registry (returns "tool unavailable, try later" instead of cascading failures); page integration owner. Verification: tool-call error rate drops; non-tool flows unaffected.

6. Noisy tenant. Symptom: single tenant's traffic dominates rate-limit metrics. Cause: tenant agent looped, tenant traffic spike, or compromised credentials. Mitigation: apply per-tenant rate-limit override; contact tenant; if security-related, revoke credentials. Verification: shared-pool metrics return to normal.

Escalation tree

The escalation tree should be one page, named, current. The format.

The escalation contacts get re-verified every quarter. A runbook that references "Alex's old phone number" is not a runbook.

Communication during incidents

The on-call's job during a customer-visible incident is half technical and half communication.

Postmortem trigger

A postmortem is required when any of the following holds.

The postmortem is blameless (focus on systems, not people), due within 5 business days, and produces at least one runbook update or alert tuning. The Google SRE postmortem template is a sane starting point (Google SRE Book, Chapter 15, 2016).

Keeping the runbook healthy

Three rules keep the runbook from rotting.

Every postmortem produces at least one runbook update. A new mitigation, a tightened threshold, a clarified escalation path. If a postmortem produces zero runbook changes, either the incident was already covered (great) or the team is not learning (a problem).

Quarterly dry-run. One on-call rotation per quarter, the team runs through the runbook cold against a synthetic scenario. The exercise catches stale phone numbers, broken dashboard links, and mitigations whose commands no longer work.

Onboarding test. A new engineer takes the runbook and a fake page; can they resolve a class-3 incident without help? If not, the runbook is not yet good enough for them to take primary.

FAQ

What is an on-call runbook for an AI agent platform?
A written playbook the on-call engineer follows when an alert fires. Contains alert taxonomy, triage steps, escalation paths, mitigation commands, and postmortem triggers.
What alerts should fire for an agent platform?
Error rate above threshold, latency p99 above threshold, model provider unhealthy, queue depth above SLO, and per-tenant rate-limit hits. Each maps to a runbook section.
How long should an agent platform on-call rotation be?
One week is standard. Shorter loses handoff context; longer drives pager fatigue.
What are the most common agent platform incidents?
Model provider outage, rate-limit exhaustion, prompt regression after deploy, vector index degradation, and tool integration failures.
When should the on-call escalate?
When the issue is outside the runbook, when the documented mitigation has not worked within the 15-minute budget, or when scope exceeds a single component.
What goes in the runbook versus the postmortem?
The runbook is for the next on-call engineer who hits the same alert. The postmortem is the team learning from a specific incident, and produces runbook updates as artifacts.

Sources