The model is not the safety system. The model is the part you are deploying. Every control that protects users, data, and the bill from the model goes around it, not inside it. This guide is the runtime controls playbook for production AI agents: four layers, the patterns that work, and the failures each layer catches.
For the conceptual frame on what guardrails are, see AI agent safety and guardrails. This piece is the operations counterpart, with concrete patterns and references.
The four layers of agent guardrails
Effective agent safety is layered. No single guardrail catches everything; the combination is what works.
| Layer | What it does | Catches |
|---|---|---|
| Input filters | Inspect user and tool inputs before the model sees them | Prompt injection, PII exfiltration attempts, jailbreaks |
| Output classifiers | Inspect model outputs before they reach the user or downstream system | Policy violations, data leaks, hallucinated structured output |
| Action allowlists | Enforce per-tool limits in code outside the model | Unauthorized actions, scope creep, runaway loops |
| Blast-radius caps | Cap dollar spend, action counts, and lateral reach per run | Cost bombs, mass-action mistakes, breach amplification |
Input filters
Input filters inspect every message before the model sees it. They block obvious injection patterns, redact PII when policy requires it, and reject malformed or oversized payloads. The OWASP LLM Top 10 (2024) lists prompt injection as the number-one risk for LLM applications.
What input filters actually catch. Regex for known prompt-injection markers ("ignore previous instructions", "" with a fake reopening, repeated unicode trickery). Length checks (payloads above N kilobytes are likely document-paste attacks). Content-type checks (a tool result that should be JSON but is plain text triggers an alert).
What input filters cannot catch. Sophisticated injections embedded in legitimate content. This is why input filtering is one layer, not the only layer. Pair it with output classifiers and action allowlists; injection that makes it through the filter still has to bypass the downstream controls to do damage.
Output classifiers
Output classifiers inspect what the model produces before it leaves the agent. Two kinds matter in practice.
Policy classifiers check for prohibited content categories: PII leaks, hate speech, code execution requests embedded in normal responses. Most are themselves ML models. OpenAI's Moderation API (free tier) and Anthropic's Claude with system-prompt guardrails are both used in production for this layer (OpenAI Moderation).
Format validators check that structured outputs match the declared schema. The model returns "{name: 'x', amount: 100}" instead of "{name: 'x', amount: '100 USD'}". The validator rejects the output and either retries with the schema error included or escalates. JSON Schema with a fast validator (Ajv for JS, Pydantic for Python) takes about an hour to wire in and prevents an entire class of downstream parse errors.
Action allowlists
The highest-leverage runtime control. An action allowlist is a declared list of every tool the agent can call, with explicit per-tool limits, enforced in code outside the model.
What an allowlist entry looks like. Tool name. Allowed scopes: which tenants, which users can trigger this tool. Quantitative limits: max recipients, max dollar amount, max records affected per call. Approval requirement: whether human approval is needed (always, or above a threshold). Rate limit: max calls per minute, per hour, per day.
The control is enforced regardless of what the model says. The model can call send_email with recipients=2000; the runtime rejects the call because the allowlist caps recipients at 100. The model can call refund with amount=10000; the runtime rejects because the allowlist requires approval above $500.
For deeper patterns on limiting agent actions, see how to limit agent actions.
Blast-radius caps
Blast-radius caps limit how much damage a single agent run can do. They sit one layer above the action allowlist.
Three caps in production. Per-run cap: max actions per run, max spend per run, max records modified per run. Per-day cap: max runs per agent per day, max spend per agent per day. Per-tenant cap: max spend per tenant per month, max lateral reach (how many users one agent can affect).
The per-run cap is the most important. It catches runaway loops, prompt-injection-induced floods, and the "I meant to delete one row but the agent deleted ten thousand" class of incident. Set it conservatively at first; loosen only when production data shows the cap blocking real, legitimate usage.
Prompt injection defenses
Prompt injection is the inability of language models to reliably distinguish trusted instructions (your system prompt) from untrusted data (tool results, user input). It is the number-one LLM-application risk per OWASP (2024) and the subject of active research at Anthropic and Google (Anthropic Constitutional AI, 2023; Google DeepMind prompt injection work, 2024).
Defenses in layers. Treat tool results as untrusted. Render them as data, never let them carry instructions. Approval boundaries on side-effects. Any tool with a permanent effect (sending money, sending email outside the org, modifying records) requires explicit user approval, not "the agent decided to." Separation of trust. The system prompt and user instructions get one trust level; tool outputs get another. Prompts that try to elevate themselves through markup or "system:" prefixes are stripped. Monitoring. Log every tool call with the input source. A spike in tool calls that traces back to a single tool result is the canonical injection signature.
Kill switch and human override
Every production agent needs three operator controls.
Global halt. One toggle that stops every agent run in flight. Used during incidents.
Tenant suspend. Pause runs for one tenant without affecting others. Used when a tenant exhibits anomalous behavior or asks for it.
Tool revoke. Disable a specific tool across all agents without redeploying. Used when an integration goes bad on the provider side, or when a vulnerability is discovered.
These controls are dormant most of the time. They are also the difference between a five-minute incident and a five-hour incident the first time you need them. Test the kill switch in a drill before you need it for real.
For broader coverage of security in production, see AI agent security best practices.
Testing guardrails
Guardrails that have never been tested are guardrails you do not have. Three test categories belong in CI for every production agent.
Allowlist negative tests. For every tool with a quantitative limit, write a test that attempts to exceed the limit and verifies the runtime rejects it. The model can be told to call send_email with recipients=2000; the test asserts the runtime returns the disallowed-action error.
Prompt-injection regression suite. Maintain a labeled set of injection attempts that previously broke the agent. Each attempt is a test case that should not produce the injected behavior. New injections discovered in production get added to the suite. The suite runs on every prompt change.
Kill-switch drill. Once per quarter, exercise the kill switch in a staging environment under realistic load. Verify global halt actually stops in-flight runs, tenant suspend isolates correctly, and tool revoke takes effect within the SLA. The drill catches the configuration drift that makes the kill switch slower than expected when you finally need it.
Coverage of these three categories distinguishes agents that pass an audit from agents that hope to pass an audit.
Frequently asked questions
What are AI agent guardrails?
Runtime controls applied outside the model: input filters, output classifiers, action allowlists, and blast-radius caps.
Are model-level safety filters enough for an AI agent?
No. Model filters block obvious harms; application-layer guardrails enforce app-specific rules the model cannot.
What is the most important AI agent guardrail?
An action allowlist with explicit per-tool limits enforced in code outside the model.
How do I prevent prompt injection in an AI agent?
Treat tool results as untrusted, require explicit approval for side-effects, separate trust levels, and monitor for injection signatures.
Should AI agents have a kill switch?
Yes. Global halt, tenant suspend, and tool revoke. Test in a drill before you need it.
Three things to ship this week
- Write the allowlist. Every tool, every quantitative limit, enforced in code.
- Set blast-radius caps: per-run actions and spend, per-tenant daily limit.
- Add the kill switch with global halt, tenant suspend, and tool revoke.
Sources
- OWASP Foundation, "OWASP Top 10 for Large Language Model Applications", 2024, owasp.org
- Anthropic, "Responsible Scaling Policy", 2024, anthropic.com
- Anthropic, "Building Effective Agents", 2024, anthropic.com
- OpenAI, "Moderation API guide", platform.openai.com
- NIST, "AI Risk Management Framework", 2023, nist.gov