Most security guides written for large language models stop at the prompt boundary. They assume a single completion, no tools, no state, no autonomy. That model has not described production deployments for at least eighteen months. An AI agent reads from untrusted retrieval surfaces, invokes write-capable tools, accumulates memory across runs, and chains many calls into a single committed action. Each of those properties changes the threat model.

This playbook is the security baseline I would hand to an engineering lead shipping their first agent to production in 2026. It maps the OWASP LLM Top 10 (2025) onto the agent surface, names the controls that earn their keep, and points out where the cheap defenses are. The reader should leave able to draft a threat model, scope permissions per task, and drill a stop in staging by next Friday.

Why agent security is different from LLM security

An LLM call has a closed input-output boundary: you send a prompt, you get a completion, and the surrounding application decides what to do with it. The attack surface is the prompt and the output. With an agent, the model itself decides what to call next. Anthropic's engineering team describes this loop directly: "Agents are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks" (Anthropic, 2024). That dynamic direction is the threat surface that single-prompt guides do not cover.

Three properties drive the difference. First, agents read from untrusted surfaces. A retrieved document, a Slack message, a webhook body, or a search result can carry instructions that the model will read as part of its working prompt. Second, agents invoke side effects. A tool call to send an email, post a refund, or grant access has a real-world consequence that is hard to reverse. Third, agents persist state. Memory across runs means an attack can plant a payload today and detonate it next week.

The implication is that you cannot defend an agent purely at the prompt layer. You must defend it at the identity, tool, memory, and audit layers as well.

Identity and least privilege

Identity is the most consequential lever. Every agent run should hold a credential set scoped exactly to what that run needs, no broader.

Per-run scoped tokens

Issue short-lived tokens (15-60 minutes) per run rather than long-lived service accounts. If the run is compromised, the blow-out time is bounded. NIST's AI Risk Management Framework calls this out as a baseline control for managed AI systems (NIST AI RMF, 2023).

Tool grant orthogonal to capability grant

Do not let buying a capability auto-grant a tool. An agent that can summarise emails should not, by default, hold a Gmail send token. Grant write tools explicitly, per-agent, with a documented business justification logged at grant time.

Sandboxed credentials for high-risk actions

Tools that move money, send communication, or grant access run from a sandbox identity, never from the agent's primary identity. The sandbox identity has rate limits, allow-listed recipients (for email and webhook tools), and a forced verification step.

Blast radius controls

Even with strict identity scoping, agents will sometimes do the wrong thing. Blast radius controls cap how much damage one wrong action can cause.

Spend caps per run and per agent

Every run carries a maximum spend, beyond which the agent halts and escalates. Budgets apply to LLM tokens, tool invocation counts, and external API charges. The cap is half a backstop against runaway loops and half a backstop against intentional cost-bomb attacks via prompt injection.

Irreversible operations require a human

Refunds, deletes, sends to large audiences, account changes. These are the actions that cannot be rolled back cheaply. The agent should be allowed to draft them and stage them; a human approves the commit. The cost of one extra click is far less than the cost of one wrong refund.

Allow-lists for high-impact targets

Outbound email goes only to addresses on the allow-list. Outbound webhooks go only to registered URLs. Database deletes apply only to tables tagged as agent-writable. Each allow-list is a kill switch against payloads that try to redirect output.

Prompt injection defense

Prompt injection is the LLM01 risk in the OWASP LLM Top 10 and the single most reliable attack vector on agents in 2026 (OWASP, 2025). The attacker plants instructions inside content the agent will read.

Separate instructions from data

The system prompt holds instructions. Retrieved content, user input, and tool output go into a separate, labelled section of the prompt. Use structural delimiters (e.g., XML-style tags) so the model can be trained or prompted to treat the data block as untrusted by default.

Output classifier before tool calls

Before any tool call, run the proposed action through a deterministic classifier (regex plus small model) that flags suspicious patterns: instructions to a different audience, unexpected recipients, base64 payloads, attempts to extract credentials. The classifier blocks at the orchestration layer, not the model layer.

Read-back verification on irreversible actions

For any irreversible action, the agent should re-read the action back in plain language and require a deterministic confirmation step (human or signed token). This breaks the chain on injection attempts that try to fly an action under the model's reasoning.

Secrets and memory hygiene

Agent memory is a credential surface that did not exist in the LLM era. A persistent vector store that holds notes across runs can also hold leaked secrets if you let it.

Never log raw prompts

Raw prompts often contain API responses, tokens, customer data. Hash or redact secrets at log time. The audit trail wants the structure of the call, not the payload.

Rotate secrets on schedule

Tool credentials rotate at a defined cadence (30 days for most, 7 days for high-risk targets). Rotation is automated; manual rotation is a known failure mode. CISA's guidance on identity hygiene calls out automated rotation as the baseline (CISA, 2024).

Encryption at rest with per-tenant key isolation

Memory stores hold per-tenant data. Keys are per-tenant, managed by a KMS. A breach of one tenant's key does not expose the others. This also makes data-residency claims (EU-resident data stays EU-resident) enforceable rather than aspirational.

Audit trails and forensics

When (not if) something goes wrong, the audit trail decides how quickly you can answer "what happened, what did it touch, can we undo it". Treat it as a first-class system, not a side effect.

Hash-chained, tamper-evident logs

Each audit entry includes a hash of the previous entry. An attacker who edits an entry breaks the chain, which a daily verifier catches. The chain itself is replicated to write-once storage.

Structured spans per tool call

Every tool call emits a structured event: who, what, when, input hash, output hash, latency, outcome. The OpenTelemetry GenAI semantic conventions are the right schema to standardise on (OpenTelemetry GenAI semantic conventions, retrieved 2026). For the full observability picture see AI agent monitoring and observability.

Retention long enough for incident response

The 2024 IBM Cost of a Data Breach report puts mean time to identify a breach at 194 days and mean time to contain at 64 days (IBM Cost of a Data Breach, 2024). Audit retention shorter than that leaves you blind. Twelve months is the practical floor.

Kill switch and incident response

A documented stop-time SLA separates a real production agent from a demo. The kill switch halts in-flight tool calls, revokes active tokens, freezes the agent from new runs, and writes a sealing audit entry.

Drill the stop quarterly

Game-day the kill switch. Trigger a stop while an agent is in flight; measure how long it actually takes to halt all running calls and confirm tokens are revoked. The drill catches the failure modes you cannot find by reading the spec.

Runbook for the first 60 minutes

Who declares the incident, who pulls the kill switch, who notifies customers, who preserves logs. Pre-write the runbook so the first hour does not depend on someone improvising.

Post-incident review

Every incident produces a written postmortem with timeline, root cause, and concrete remediation. The remediation gets a ticket. The ticket gets shipped. This is the discipline that compounds.

OWASP LLM Top 10 mapped to agents

The OWASP LLM Top 10 (2025) is the consensus baseline for LLM application risks. For agents, the mapping is direct but each item gains additional considerations.

OWASP itemAgent-specific extension
LLM01 Prompt InjectionRetrieved content and tool output are the dominant vectors, not user input.
LLM02 Sensitive Information DisclosureMemory store leakage; audit trail leakage; per-tenant isolation matters.
LLM03 Supply ChainTool and integration registry must be vetted. Third-party MCP servers are a vector.
LLM04 Data and Model PoisoningMemory poisoning across runs is the agent-native form.
LLM05 Improper Output HandlingTool argument validation; structured output schema enforcement.
LLM06 Excessive AgencyHighest-impact for agents. Least privilege and tool grant orthogonality.
LLM07 System Prompt LeakageMulti-tool agents have multiple prompts; each is a leak surface.
LLM08 Vector and Embedding WeaknessesMemory store is a primary embedding surface. Index hygiene matters.
LLM09 MisinformationAgent-generated reports and customer-visible content require verification.
LLM10 Unbounded ConsumptionSpend caps, rate limits, loop detection.

For the deeper view on safe-by-default guardrail patterns see AI agent safety and guardrails and AI agent failure modes.

Frequently asked questions

What are the top AI agent security risks in 2026?

Prompt injection through retrieved content, over-permissioned tool calls, secret leakage in agent memory, and silent action drift when an upstream API changes shape. OWASP LLM01 prompt injection and LLM06 excessive agency are the two with the highest blast radius in production agent deployments.

Does OWASP LLM Top 10 cover agents?

Partially. The 2025 list names excessive agency explicitly as LLM06. Agent-specific risks such as multi-step tool chaining and persistent memory are extensions of LLM01 and LLM06. Use OWASP as the baseline and layer agent-specific controls on top.

How do I defend against prompt injection in an agent?

Treat every retrieved or tool-returned token as untrusted. Separate instructions from data in the prompt template. Strip control sequences. Run an output classifier before tool calls. For high-blast-radius actions, require a deterministic verification step before commit.

What is blast radius for an AI agent?

Blast radius is the maximum damage a single agent action can cause if it executes incorrectly. Reducing blast radius means scoping permissions per task, capping spend per run, requiring human approval for irreversible operations, and using sandboxed credentials with short-lived tokens.

Should every AI agent have a kill switch?

Yes. A documented stop-time SLA is non-negotiable for production agents. The kill switch should be a single command or button that halts all in-flight tool calls, revokes active tokens, and writes a final audit entry. Verify the SLA in staging by drilling the stop quarterly.

Three controls to ship this week

Sources