AI Agent Trust Models: Four Levels, Audit Trails, Recovery

The first time an agent does the wrong thing in production is the day a trust model becomes a budget line. Every team eventually writes one. The question is whether you write it before the incident or after. This post lays out the four standard levels, the variables that decide which level a given action belongs at, the audit-trail bar, and the recovery pattern after the first time something goes wrong.

The framing draws on NIST's AI Risk Management Framework, which makes the trust-vs-blast-radius trade-off explicit (NIST AI RMF, retrieved 2026-05-09). The same shape appears in the EU AI Act's risk-tier framing for high-risk systems and in OWASP's LLM Top 10 (OWASP LLM Top 10, retrieved 2026-05-09). Different vocabularies, same operational structure.

What a trust model is

A trust model is the policy that decides what an agent is allowed to do unilaterally, what it must propose for approval, and what is forbidden. It is not the agent's prompt and it is not the tool catalogue. It is a layer above both. The prompt tells the agent how to think; the catalogue tells it what tools exist; the trust model tells the orchestrator whether the agent is allowed to call a tool right now without a human in the loop.

Most teams skip the explicit model and assume "the agent will be careful." The first incident teaches them that careful is a property of the policy, not the agent. The cluster post on agent safety and guardrails covers the technical guardrails; this post is about the governance layer that decides when the guardrails are enough to act unilaterally.

The four levels in detail

Level 1: read-only

The agent observes but takes no action. It can read a CRM, summarise an inbox, query a dashboard. It cannot send messages, change records, or move money. Useful for tasks where the value is the synthesis (a Monday morning status digest, a competitor-tracking report) and the consequences of acting are not worth the approval cost.

Level 2: suggest

The agent drafts an action and presents it to a human, who decides whether to apply. The agent does not push the button itself. Most production agent products start here. Customer support draft replies, sales follow-up email drafts, code review comments. The level is honest: the agent is a producer, the human is a publisher.

Level 3: approve-then-act

The agent proposes a specific action and waits for explicit confirmation, then executes. Different from level 2 because the agent is the executor; the human is the gate. Used when execution itself is non-trivial (touching multiple systems, dealing with auth) and the agent doing it once-approved is materially better than the human doing it manually after seeing the suggestion.

Level 4: autonomous

The agent acts without per-step approval, inside a guardrail. Reserved for low-blast-radius, reversible, high-frequency actions: tagging email, categorising expenses, updating a watch list. The guardrail is essential: hard limits on volume, allow-listed domains, monetary caps. Without the guardrail, level 4 is gambling.

The autonomy spectrum is a continuous gradient, not a step function. The framework in autonomous vs assistive AI scores agents on five autonomy axes; the four-level shorthand here is a coarser version of the same idea designed for fast policy decisions.

How to decide which actions need approval

Three variables drive the decision.

Blast radius. If the action goes wrong, how many people, accounts, or records does it touch? Sending one email is small radius. Updating every customer's billing email is large radius. Posting to a shared Slack channel is mid radius. Map each tool in the agent's catalogue to a blast-radius score before deciding the trust level.

Reversibility. Can the action be undone? Sending an email is irreversible (the recipient saw it). Updating a CRM field is reversible (write the old value back) but only if the audit trail captured the old value. Initiating a wire transfer is irreversible past a deadline that varies by jurisdiction. Reversibility cuts the trust level by one when low.

Frequency. How often does the action happen? A one-off action (annual contract renewal) deserves human approval because the cost of approval per occurrence is low. A thousand-per-day action (categorising expenses) cannot be human-approved at scale and must be autonomous-with-guardrail or it is not worth automating.

Combine the three: high blast radius and irreversible and low frequency means level 1 or 2. Low blast radius and reversible and high frequency means level 4. Most actions sit somewhere in between; the decision matrix is your team's.

Audit-trail requirements

Audit trails are not optional and they are not log files. The audit trail is the system of record for what the agent did, in a form that survives system migrations, lawsuits, and SOC2 audits. The minimum field set is six items:

Timestamp with timezone, ISO 8601, monotonic.
Agent identifier linking to the prompt version, model name and version, and tool catalogue version.
Goal context capturing the user request and any state that informed the agent's decision.
Tool called with namespace and version.
Full input arguments as the agent passed them to the tool.
Result returned as the tool returned to the agent, including error states.

The trail must be append-only and queryable. Append-only because mutability invalidates the trail's value as evidence. Queryable by user, action type, and time window because those are the three lenses incident review uses. Retention is industry-specific: financial actions typically seven years, healthcare six years, general contracts three to seven. Check the regulatory requirements for the data the agent touches.

Storage cost is negligible compared to incident cost. A team that argues against full audit trails on storage grounds has not yet had an incident; teams that have one always retrofit comprehensive audit trails immediately after.

Trust after the first incident

The first incident is when an agent does something it should not have done: sent an email to the wrong recipient, refunded the wrong customer, deleted a calendar event that was load-bearing. The pattern of recovery, across the post-mortems we have read and the ones we have written ourselves, is consistent.

Contain. Disable the specific action immediately. Not the entire agent: the specific action. Most teams over-correct and disable the agent, which is theatre because the agent will be re-enabled on a tight schedule and the deeper problem has not been fixed.
Downgrade. Drop the trust level for that action by one or two notches. If the action was autonomous and went wrong, drop to approve-then-act. If it was approve-then-act and the approval was rubber-stamped, drop to suggest.
Diagnose and fix. Find the root cause. The cluster post on failure modes covers the typical suspects: hallucinated arguments, stale context, tool-output misparsed, prompt drift after a model update.
Add a regression test. The 80-test methodology described in how we test AI agents exists exactly so that incidents map to specific test cases. The first thing post-incident is the new test that would have caught it.
Run at the lower trust level until you match the incident-free count that produced the original trust grant. If the original level required 1,000 incident-free runs to graduate, you need 1,000 again at the new lower level. There is no shortcut.
Graduate. Restore the level only after the count is met and the regression test stays green.

Skipping any step usually produces a second incident inside 60 days. Skipping the regression test is the most common skip and the most reliable predictor of repeat incidents.

Questions to ask an agent vendor

Procurement evaluations typically watch the demo at maximum trust level and stop. The harder question is: how does the system behave when trust must be downgraded, audit trails inspected, or actions reversed? Five questions to ask:

What is the default trust level per action type and how is it configured per customer?
What audit-trail fields are captured? Can I export them to my SIEM?
How fast can I disable a specific action without disabling the agent?
How long does retention extend, and is the retention period configurable per regulation?
What is the documented recovery process after an incident? Have you executed it on a real customer incident, and may I see a redacted post-mortem?

Vendors who answer the first four and stumble on the fifth typically have not had a serious incident yet. That is not a guarantee but it is a useful signal. The buyer's checklist in agent glossary for buyers covers the broader vocabulary that helps these conversations move fast.

Frequently asked questions

What is an AI agent trust model?

An AI agent trust model is the policy that decides which actions an agent can take without human approval, which require approval, and which are forbidden outright. It maps action types to permission levels and to the audit-trail and reversibility requirements that match those levels.

What are the four levels of agent trust?

Read-only (the agent observes but takes no action), suggest (the agent drafts but a human applies), approve-then-act (the agent proposes and waits for confirmation), and autonomous (the agent acts without per-step approval but inside a guardrail). Most production agents start at level two and graduate to level three after incident-free operation; level four is reserved for low-blast-radius actions.

How do I decide which actions need approval?

The decision is a function of three variables: blast radius (how far the action propagates if wrong), reversibility (can the action be undone), and frequency (one-off or thousand-per-day). Actions that are high blast radius, irreversible, or low frequency should require approval. Actions that are low blast radius, reversible, and high frequency can be autonomous.

What does an agent audit trail need to contain?

Six fields per action: timestamp, agent identifier, goal context, tool called, full input arguments, and the result returned. The trail must be immutable, retained per regulatory requirement (typically seven years for financial actions), and queryable by user, action type, and time window. Without the trail there is no recovery and no incident review.

How does trust recover after an agent incident?

Slowly and only after the root cause is fixed. The pattern is: contain the incident, downgrade the agent to a lower trust level, deploy the fix with a regression test, run at the lower trust level until the incident-free run count matches what produced the original trust grant, then graduate. Skipping any step typically produces a second incident within 60 days.

Three takeaways before you close this tab

Write the trust model before the first incident. Or after the second, if you must.
The audit trail is six fields, immutable, queryable. Anything less and you have nothing to investigate with.
Recovery after an incident takes weeks, not days. Plan for the calendar, not the press release.

Sources

NIST, "AI Risk Management Framework", retrieved 2026-05-09, nist.gov/itl/ai-risk-management-framework
OWASP, "Top 10 for Large Language Model Applications", retrieved 2026-05-09, owasp.org/www-project-top-10-for-large-language-model-applications
European Union, "AI Act, Regulation (EU) 2024/1689", retrieved 2026-05-09, eur-lex.europa.eu/eli/reg/2024/1689/oj
Anthropic, "Building Effective Agents", retrieved 2026-05-09, anthropic.com/engineering/building-effective-agents
SOC 2 Trust Services Criteria, AICPA, retrieved 2026-05-09, aicpa-cima.com/topic/audit-assurance