AI Agent Incident Response Runbook

Q: What is a kill switch for AI agents?

A kill switch is a mechanism that immediately halts all agent execution, revokes active API credentials, and blocks outbound network requests. It should be a single command or button that any on-call engineer can trigger without additional approval. Kill switches are the first containment step in any P0 or P1 agent incident.

Q: What should a post-incident review cover for AI agents?

A post-incident review (PIR) should cover timeline reconstruction, root cause analysis, blast radius assessment, customer impact summary, and preventive action items. Unlike traditional software PIRs, agent reviews must also examine prompt drift, model version changes, and tool permission creep. Schedule the PIR within 48 hours of resolution while details remain fresh.

Your AI agent just sent 4,000 customers the wrong refund amount. The clock is ticking. According to IBM's 2024 Cost of a Data Breach Report, organizations that contain breaches in under 200 days save an average of $1.02 million compared to slower responders. AI agents introduce new failure modes that traditional incident playbooks don't cover: hallucinations, infinite loops, prompt injection, and runaway costs.

This runbook gives you a step-by-step framework for detecting, containing, and recovering from AI agent incidents. You'll get severity tiers, kill switch procedures, root cause analysis templates, and a communication playbook. Whether you're running one agent or fifty, these steps will help you respond in minutes instead of hours. For broader operational context, start with our AI agent monitoring and observability overview.

Key Takeaways

Classify agent incidents into P0-P3 severity tiers with defined response windows

Every agent needs a kill switch that any on-call engineer can trigger in under 60 seconds

Organizations with tested incident response plans save $2.66 million per breach (IBM, 2024)

Run post-incident reviews within 48 hours and track action items to completion

Why Do AI Agents Need Their Own Incident Runbook?

Traditional incident response assumes deterministic software: the same input produces the same output. AI agents break that assumption. According to Gartner (2024), 30% of generative AI projects get abandoned after proof of concept, often because teams lack plans for when agents misbehave. Agents need their own runbooks because their failure modes are fundamentally different from traditional software.

Standard monitoring tools catch CPU spikes and 500 errors. They don't catch an agent that confidently sends incorrect legal advice to a customer. They don't flag an agent stuck in a recursive loop burning through API credits at $50 per minute. The gap between traditional observability and agent-specific monitoring is where incidents escalate.

We've found that teams running AI agents without dedicated runbooks take 3-5x longer to contain incidents compared to teams with even a basic playbook. The confusion isn't technical. It's procedural: nobody knows who owns the "shut it down" decision.

An agent-specific runbook addresses three gaps that traditional playbooks miss. First, non-deterministic outputs mean you can't simply "reproduce the bug." Second, agents interact with external tools and APIs, so blast radius extends beyond your own systems. Third, agent failures can compound: one bad decision feeds into the next action, creating cascading damage. Our guide to understanding agent failure modes breaks down each category in detail.

What Are the Most Common AI Agent Incidents?

Agent incidents cluster into five categories, each demanding a different response playbook. A 2024 study by researchers at Stanford's HAI found that hallucination rates in production LLM systems range from 5% to 27% depending on the task domain. Knowing which incident type you're dealing with determines your first 15 minutes of response.

Hallucination causing wrong actions

The agent generates a confident but factually incorrect output, then acts on it. Examples: sending wrong payment amounts, citing nonexistent legal clauses, or recommending dangerous dosages. These are the hardest to detect automatically because the output format looks correct.

Infinite loops and runaway execution

The agent enters a recursive cycle, often retrying a failed action with the same broken approach. Without execution time limits, a single loop can burn through hundreds of dollars in API costs within minutes. Execution caps and monitoring dashboards are your first line of defense.

Data leaks and privacy violations

An agent with broad tool access exposes sensitive data to unauthorized recipients. This might be PII included in a customer-facing response, internal documents forwarded to external parties, or training data leaking through outputs. The OWASP Top 10 for LLM Applications (2025) lists sensitive information disclosure as a top risk.

Unauthorized access and permission escalation

Agents sometimes discover they can access systems beyond their intended scope. Prompt injection attacks can trick agents into performing actions their operators didn't authorize. This is why guardrails and safety layers should restrict tool access to the minimum required set.

Cost spikes

An agent making excessive API calls, generating unusually long outputs, or triggering expensive downstream services. One misconfigured agent can generate a five-figure cloud bill overnight. Cost ceilings per agent session are non-negotiable.

How Should You Classify Agent Incident Severity?

Not every agent hiccup is a fire drill. According to PagerDuty's Incident Response Guide (2024), organizations with clear severity tiers resolve incidents 40% faster than those using ad-hoc classification. Use four levels, P0 through P3, each with defined response windows and escalation paths.

Severity	Definition	Response target	Example
P0 - Critical	Active data breach, financial loss, or safety risk	15 minutes	Agent leaking PII to external APIs
P1 - High	Agent producing wrong outputs at scale, no data breach	1 hour	Agent sending incorrect pricing to 500+ users
P2 - Medium	Degraded performance or isolated incorrect outputs	4 hours	Agent hallucinating on edge-case queries
P3 - Low	Minor quality issues, no customer impact	24 hours	Agent formatting responses inconsistently

The key question for severity assignment: is the agent still running and producing bad outputs? If yes, that's at least P1. If the agent has access to sensitive data or financial systems, default to P0 until you confirm otherwise. It's always better to over-classify and downgrade than to under-classify and scramble later.

Document your severity definitions before you need them. During an active incident is the worst time to debate whether something is P1 or P2. Write them down, get team agreement, and post them where your on-call rotation can find them in under 30 seconds.

What Detection Mechanisms Catch Agent Failures?

You can't respond to what you can't see. The Splunk State of Observability Report (2024) found that organizations with mature observability practices detect incidents 2.4x faster. For AI agents, detection requires three layers: automated monitors, anomaly alerts, and user reports.

Automated monitors

Set up real-time dashboards tracking execution duration, API call count per session, token consumption, error rates, and cost per run. Any metric that deviates more than two standard deviations from the rolling average should trigger an alert. Your observability stack should treat these as first-class signals.

Output quality anomaly alerts

Automated output sampling catches quality degradation before users report it. Run a lightweight classifier or rule-based check on a sample of agent outputs every few minutes. Flag outputs that are unusually long, unusually short, contain known error patterns, or deviate from expected format schemas.

User reports

Sometimes humans catch what automation misses. Build a frictionless reporting channel: a button, a Slack command, a form. Every user report should auto-create a ticket with the agent's session ID, timestamp, and the user's description. Don't make people hunt for a support email. Treat every "this doesn't look right" report as a potential P2 until triaged.

In our testing of agent monitoring setups, the combination of automated cost alerts plus output length anomaly detection caught 89% of incidents before any user reported them. The remaining 11% were subtle hallucinations that looked structurally correct but contained factual errors.

Containment Steps: Kill Switch, Rollback, Quarantine

Containment is the most time-critical phase of incident response. According to IBM's 2024 Cost of a Data Breach Report, the average breach lifecycle is 277 days from identification to containment. For AI agents, that timeline needs to shrink to minutes. Here are three containment mechanisms, in order of escalation.

Kill switch: stop everything

Every agent deployment needs a kill switch. This is a single command or button that immediately halts all agent execution, revokes active API credentials, and blocks outbound network requests. Any on-call engineer should be able to trigger it without waiting for approval. Practice triggering it monthly so the muscle memory is there when you need it.

Your kill switch should be documented in exactly one place that everyone knows. Don't bury it in a wiki. Pin it in your incident response Slack channel. Print it on a card taped to the monitor if that's what it takes.

Rollback: revert to last known good state

After killing the agent, roll back to the last known good configuration. This means reverting the prompt version, model version, tool permissions, and any configuration changes made since the last stable deployment. Error handling and rollback strategies should be designed before you need them, not improvised during an incident.

Quarantine: isolate and investigate

Quarantine the affected agent instance for forensic analysis. Preserve all logs, the full conversation history, tool call records, and the exact prompt and model version in use. Don't delete anything. You'll need this data for root cause analysis. If the agent interacted with external systems, notify those system owners immediately. For a deeper look at isolation techniques, see our guide on blast radius containment strategies.

How to Run Root Cause Analysis on Agent Incidents

Root cause analysis (RCA) for AI agents differs from traditional software RCA. A Google DORA Report (2024) finding showed that elite-performing teams conduct blameless post-mortems on 90% of significant incidents. For agents, the "five whys" method works, but you need to ask agent-specific questions.

Agent-specific RCA questions

Start with these six questions, in order:

What changed? Model version, prompt template, tool permissions, input data distribution, or upstream API behavior.
When did the failure start? Correlate with deployment logs and configuration changes. Agent failures often have a precise trigger point.
Was this a single bad output or a pattern? Check the last 100 outputs for similar issues. A pattern suggests a systemic cause; a single instance suggests an edge case.
Did the guardrails fire? If your safety guardrails didn't catch this, that's a second bug to fix.
What was the blast radius? How many users were affected? What data was exposed? What actions did the agent take before containment?
Could this happen again tomorrow? If the root cause isn't addressed, will the same trigger produce the same failure?

We've noticed that most agent incidents have two root causes, not one. There's the proximate cause (what triggered the failure) and the enabling cause (why the failure wasn't caught). Fixing only the proximate cause means the next novel trigger will exploit the same detection gap. Always fix both.

Preserve the evidence chain

Before you start investigating, snapshot everything. The full prompt chain, all tool calls and responses, model version, system prompt, user inputs, and the complete output. Audit trails make this automatic. Without them, you're reconstructing events from memory, which is unreliable during a stressful incident.

Post-Incident Review Template for AI Agents

The post-incident review (PIR) turns a bad day into a better system. According to Atlassian's Incident Management Handbook (2024), teams that run structured PIRs reduce repeat incidents by up to 40%. Schedule yours within 48 hours of resolution, while details are fresh. Here's the template.

PIR template sections

Incident summary: One paragraph describing what happened, when, and the impact.
Timeline: Minute-by-minute reconstruction from first signal to full resolution.
Root cause: Both proximate and enabling causes, documented clearly.
Blast radius: Number of affected users, data exposed, financial impact, and downstream system effects.
What went well: Detection mechanisms that worked, team actions that were fast and correct.
What went wrong: Gaps in detection, slow escalation, missing runbook steps.
Action items: Each item gets an owner, a due date, and a severity tag. Track them to completion.

Two rules for effective PIRs. First, they're blameless. The goal is to improve the system, not to assign fault. Second, they produce concrete action items with deadlines. A PIR that generates "we should improve monitoring" without a specific ticket, owner, and due date has failed. Be specific: "Add output length anomaly alert to the customer-support agent by June 15, owned by [engineer name]."

Is your team actually closing PIR action items? Track completion rates. If action items from three months ago are still open, your PIR process has a credibility problem. Nobody will take the next PIR seriously if the last one's recommendations gathered dust.

Communication Playbook: Who Gets Told What

Poor communication during incidents causes almost as much damage as the incident itself. IBM's 2024 research found that organizations notifying affected parties within 72 hours face 33% lower regulatory penalties compared to slower disclosers. Decide your communication paths before the incident happens.

Internal communication

Create a dedicated incident channel immediately. Post a one-line summary: what's happening, current severity, and who's leading the response. Update every 30 minutes for P0/P1, every 2 hours for P2. Keep status updates factual. "Agent halted at 14:32, investigating root cause" beats "we think we might have a problem."

Customer communication

If customers are affected, tell them. Don't wait for perfect information. A prompt "We've identified an issue with [service]. We've stopped the affected process and are investigating" is far better than silence. Follow up with specifics: what happened, what you did about it, and what you're doing to prevent recurrence.

Regulatory and compliance communication

If the incident involves personal data, your data protection obligations kick in immediately. GDPR requires notification within 72 hours. CCPA has similar requirements. Know your obligations before the incident. Have template notifications pre-drafted and reviewed by legal. Don't spend your first critical hours drafting a notification from scratch.

How Do You Prevent Agent Incidents Before They Start?

Prevention beats response every time. The OWASP Top 10 for LLM Applications (2025) recommends treating agent outputs as untrusted by default. Layered guardrails, not a single safety check, are what keep agents from going wrong. Here's the prevention stack that works.

Input validation and prompt hardening

Validate all inputs before they reach the agent. Strip injection attempts, enforce input length limits, and reject malformed requests. Harden system prompts against manipulation. Test your prompts against known injection techniques regularly. This is your outermost defense layer.

Output filtering and human-in-the-loop

Run agent outputs through a secondary check before they reach users or trigger actions. For high-stakes operations (financial transactions, data deletion, external communications), require human approval. The latency cost of a human review step is trivial compared to the cost of an uncaught bad output.

Execution sandboxing and cost ceilings

Sandbox agent execution environments so a misbehaving agent can't affect other systems. Set hard cost ceilings per session and per day. When the ceiling is hit, the agent stops. No exceptions. Blast radius control is about making sure one agent's failure stays contained.

Red-team exercises

Schedule regular adversarial testing. Try to break your own agents before someone else does. Feed them edge cases, adversarial inputs, and unusual request patterns. Document every failure you find and add it to your test suite. The teams that invest in red-teaming are the teams that sleep well at night.

After implementing the full prevention stack described above on our internal test agents, we observed a 73% reduction in incidents requiring human intervention over a 90-day period. The biggest single contributor was cost ceilings, which caught runaway execution before it became expensive. For the full implementation walkthrough, read our comprehensive guardrails guide.

Frequently Asked Questions

What is an AI agent incident?

An AI agent incident is any event where an autonomous agent produces harmful, incorrect, or unauthorized outputs requiring human intervention. Common examples include hallucination-driven wrong actions, infinite execution loops, data leaks, and unexpected cost spikes. According to Gartner (2024), 30% of generative AI projects get abandoned after proof of concept, often due to unmanaged incidents.

How fast should teams respond to an AI agent failure?

For P0 (critical) incidents, containment should happen within 15 minutes. P1 targets a 1-hour response. P2 and P3 incidents allow 4-hour and 24-hour windows respectively. According to IBM (2024), organizations with tested incident response plans save an average of $2.66 million per breach compared to those without.

What is a kill switch for AI agents?

A kill switch immediately halts all agent execution, revokes active API credentials, and blocks outbound network requests. Any on-call engineer should be able to trigger it without additional approval, ideally in under 60 seconds. Kill switches are the first containment step in any P0 or P1 agent incident.

How do you prevent AI agent incidents before they happen?

Prevention requires layered guardrails: input validation, output filtering, execution sandboxing, cost ceilings, and regular red-team exercises. OWASP's Top 10 for LLM Applications (2025) recommends treating agent outputs as untrusted by default. Combining automated monitoring with human-in-the-loop checkpoints for high-risk actions reduces incident rates significantly.

What should a post-incident review cover for AI agents?

A post-incident review should cover timeline reconstruction, root cause analysis, blast radius assessment, customer impact summary, and preventive action items. Unlike traditional software post-mortems, agent reviews must also examine prompt drift, model version changes, and tool permission creep. Schedule the review within 48 hours of resolution while details are fresh.