Code fails loudly. An exception is thrown, the process crashes, the error reaches the dashboard, the operator gets an alert. AI agents fail differently. The agent picks a wrong interpretation, takes a wrong action, produces a plausible-looking output. Nothing crashes. The dashboard shows green. The first you hear about it is when a customer notices, or when revenue dips a week later, or when a downstream report contradicts itself.

Monitoring is the discipline that closes this gap. The four layers below answer four different questions and catch four different failure modes. None substitutes for the others; agents that monitor only one layer catch only the failures that layer is designed to find.

Why agents fail silently

Three properties make agent failures hard to see. First, agents produce coherent-looking output even when wrong; a misclassification reads like a correct classification because the model is fluent. Second, agents take many small actions; a single wrong label among 100 right labels is statistical, not categorical. Third, agents operate unattended; nobody is watching the moment the wrong action fires.

The combination is dangerous. The 8 categories of AI agent failure modes include "silent degradation" as a dedicated category for this reason. Monitoring is what makes silent failures audible.

Layer 1: Action log

The action log is the lowest-level layer. Every tool call the agent makes is logged: the tool name, the parameters passed, the timestamp, the result returned by the tool, the agent's next decision based on the result. The log is structured (one row per action), durable (written to a destination the agent cannot delete), and queryable.

The action log answers "what did the agent do?" It does not answer whether what the agent did was right. That is the outcome log's job. The action log is where you go when you need to reproduce a specific failure (covered in how to debug an AI agent).

Layer 2: Outcome log

The outcome log records whether the agent's actions achieved the desired outcome. For most workloads this is sampled, not exhaustive: review 5-10% of actions weekly against a ground truth. The ground truth is whatever you would have done if the agent had not run.

Outcome logging is harder than action logging because it requires comparison. Two patterns work. First, dual-runs: the agent runs and a human (or a higher-quality but slower process) runs in parallel; the outcomes are compared. Second, downstream signals: did the customer reply, did the action get reverted, did the workflow proceed. Downstream signals are cheaper than dual-runs but lag the action by hours or days.

The 80-tests methodology in how we test AI agents uses the outcome layer extensively: each test asserts a specific outcome, not just that the agent ran without error.

Layer 3: Cost and latency metrics

Operational metrics surface the agent's resource usage. Per-task cost (covered in how to estimate AI agent cost), p50 and p95 latency, retry rate, timeout rate. Track day-over-day deltas; meaningful shifts often correlate with prompt drift, model changes, or input distribution shifts.

The most actionable metric is per-task cost trending over the last 14 days. If cost is rising without action volume rising, something is using more tokens per task: longer prompts, more retries, or a model upgrade. The cause matters less than the visibility.

Four layers, each answering a different question 1. Action log Q: What did the agent do? Tools called, params, results, decisions. 2. Outcome log Q: Did the action work? Sampled vs ground truth; downstream signals. 3. Cost and latency Q: Is the agent within budget? Per-task cost, p50/p95, retry rate, day-over-day. 4. Drift detection Q: Is the distribution changing? Input/output distribution vs rolling baseline. Source: Aryan Agarwal, Gravity observability spec, May 2026.
Each layer answers a question the others cannot. None substitutes for the others.

Layer 4: Drift detection

Drift detection compares the current week's input and output distributions against a rolling 4-week baseline. Two distributions matter. The input distribution captures what the agent reads: senders, subjects, document lengths, languages, time of day. The output distribution captures what the agent produces: classifications by category, output lengths, refusal rates, retry rates.

Meaningful shifts in either distribution are worth investigating. A new sender domain appearing in 30% of inputs is a distribution change the prompt did not anticipate. A category that used to be 5% of outputs becoming 20% might be the agent finding a real shift in workload or might be misclassification. Drift detection does not tell you which; it tells you to look.

Most drift in recurring agents is benign (the world genuinely changed). Some is symptomatic of decay. The discipline of reviewing flagged drifts weekly is the cheapest insurance against silent degradation.

The dashboard checklist

A working agent monitoring dashboard contains:

Most platforms expose the action log natively. Layers 2-4 typically need to be built on top, often by exporting the action log to a spreadsheet, BigQuery, or a metrics store and running scheduled queries. The build cost is real; the alternative is silent failure, which costs more.

Frequently asked questions

Why do AI agents fail silently?

Agent failures often produce output that looks plausible but is wrong. A wrong classification, a wrong recipient, a wrong summary. Without explicit comparison against a ground truth, the failure is invisible. Code that crashes throws an exception; agents that decide wrong produce a coherent-looking output that you would not notice without a monitoring layer.

What are the four monitoring layers for an AI agent?

Action log (what did the agent do?), outcome log (did the action achieve the desired outcome?), cost and latency metrics (is the agent operating within budget?), drift detection (is the input distribution or output distribution changing in ways that suggest the prompt is decaying?). Each layer answers a different question and surfaces different failure modes.

How do I know if my AI agent is drifting?

Drift shows up in two distributions. The input distribution shifts when new senders, new product lines, or new edge cases appear. The output distribution shifts when the agent's classifications change frequency or its outputs become longer or shorter on average. Compare the current week against a rolling 4-week baseline; meaningful shifts are worth investigating.

Should I review every AI agent action manually?

No. Manual review does not scale past the first ten supervised runs. Use sampling: review every action during the supervised window, then sample 5-10% of actions weekly, plus 100% of actions flagged by anomaly detection. The combination catches systematic drift via sampling and one-off issues via flagging.

What goes on an AI agent monitoring dashboard?

Daily action counts by tool, weekly outcome accuracy on a sampled subset, per-task cost and latency with 50th and 95th percentile, and drift indicators on input and output distributions. Plus an anomaly feed: any action that fired an alert. The dashboard is for the operator, not for the agent.

Three takeaways before you close this tab

Sources