How to Monitor What an AI Agent Is Doing

Code fails loudly. An exception is thrown, the process crashes, the error reaches the dashboard, the operator gets an alert. AI agents fail differently. The agent picks a wrong interpretation, takes a wrong action, produces a plausible-looking output. Nothing crashes. The dashboard shows green. The first you hear about it is when a customer notices, or when revenue dips a week later, or when a downstream report contradicts itself.

Monitoring is the discipline that closes this gap. The four layers below answer four different questions and catch four different failure modes. None substitutes for the others; agents that monitor only one layer catch only the failures that layer is designed to find.

Why agents fail silently

Three properties make agent failures hard to see. First, agents produce coherent-looking output even when wrong; a misclassification reads like a correct classification because the model is fluent. Second, agents take many small actions; a single wrong label among 100 right labels is statistical, not categorical. Third, agents operate unattended; nobody is watching the moment the wrong action fires.

The combination is dangerous. The 8 categories of AI agent failure modes include "silent degradation" as a dedicated category for this reason. Monitoring is what makes silent failures audible.

Layer 1: Action log

The action log is the lowest-level layer. Every tool call the agent makes is logged: the tool name, the parameters passed, the timestamp, the result returned by the tool, the agent's next decision based on the result. The log is structured (one row per action), durable (written to a destination the agent cannot delete), and queryable.

The action log answers "what did the agent do?" It does not answer whether what the agent did was right. That is the outcome log's job. The action log is where you go when you need to reproduce a specific failure (covered in how to debug an AI agent).

Layer 2: Outcome log

The outcome log records whether the agent's actions achieved the desired outcome. For most workloads this is sampled, not exhaustive: review 5-10% of actions weekly against a ground truth. The ground truth is whatever you would have done if the agent had not run.

Outcome logging is harder than action logging because it requires comparison. Two patterns work. First, dual-runs: the agent runs and a human (or a higher-quality but slower process) runs in parallel; the outcomes are compared. Second, downstream signals: did the customer reply, did the action get reverted, did the workflow proceed. Downstream signals are cheaper than dual-runs but lag the action by hours or days.

The 80-tests methodology in how we test AI agents uses the outcome layer extensively: each test asserts a specific outcome, not just that the agent ran without error.

Layer 3: Cost and latency metrics

Operational metrics surface the agent's resource usage. Per-task cost (covered in how to estimate AI agent cost), p50 and p95 latency, retry rate, timeout rate. Track day-over-day deltas; meaningful shifts often correlate with prompt drift, model changes, or input distribution shifts.

The most actionable metric is per-task cost trending over the last 14 days. If cost is rising without action volume rising, something is using more tokens per task: longer prompts, more retries, or a model upgrade. The cause matters less than the visibility.

Each layer answers a question the others cannot. None substitutes for the others.

Layer 4: Drift detection

Drift detection compares the current week's input and output distributions against a rolling 4-week baseline. Two distributions matter. The input distribution captures what the agent reads: senders, subjects, document lengths, languages, time of day. The output distribution captures what the agent produces: classifications by category, output lengths, refusal rates, retry rates.

Meaningful shifts in either distribution are worth investigating. A new sender domain appearing in 30% of inputs is a distribution change the prompt did not anticipate. A category that used to be 5% of outputs becoming 20% might be the agent finding a real shift in workload or might be misclassification. Drift detection does not tell you which; it tells you to look.

Most drift in recurring agents is benign (the world genuinely changed). Some is symptomatic of decay. The discipline of reviewing flagged drifts weekly is the cheapest insurance against silent degradation.

The dashboard checklist

A working agent monitoring dashboard contains:

Daily action counts by tool. Did the agent run? How many times? Which tools?
Weekly outcome accuracy on sampled subset. Of the 5-10% reviewed, what fraction were correct?
Per-task cost and latency. p50 and p95 with 14-day trend lines.
Retry and timeout rates. 14-day trend lines.
Input distribution drift. Top 10 input attributes vs baseline.
Output distribution drift. Top 10 output categories vs baseline.
Anomaly feed. Actions that fired an alert this week.

Most platforms expose the action log natively. Layers 2-4 typically need to be built on top, often by exporting the action log to a spreadsheet, BigQuery, or a metrics store and running scheduled queries. The build cost is real; the alternative is silent failure, which costs more.

Frequently asked questions

Why do AI agents fail silently?

Agent failures often produce output that looks plausible but is wrong. A wrong classification, a wrong recipient, a wrong summary. Without explicit comparison against a ground truth, the failure is invisible. Code that crashes throws an exception; agents that decide wrong produce a coherent-looking output that you would not notice without a monitoring layer.

What are the four monitoring layers for an AI agent?

Action log (what did the agent do?), outcome log (did the action achieve the desired outcome?), cost and latency metrics (is the agent operating within budget?), drift detection (is the input distribution or output distribution changing in ways that suggest the prompt is decaying?). Each layer answers a different question and surfaces different failure modes.

How do I know if my AI agent is drifting?

Drift shows up in two distributions. The input distribution shifts when new senders, new product lines, or new edge cases appear. The output distribution shifts when the agent's classifications change frequency or its outputs become longer or shorter on average. Compare the current week against a rolling 4-week baseline; meaningful shifts are worth investigating.

Should I review every AI agent action manually?

No. Manual review does not scale past the first ten supervised runs. Use sampling: review every action during the supervised window, then sample 5-10% of actions weekly, plus 100% of actions flagged by anomaly detection. The combination catches systematic drift via sampling and one-off issues via flagging.

What goes on an AI agent monitoring dashboard?

Daily action counts by tool, weekly outcome accuracy on a sampled subset, per-task cost and latency with 50th and 95th percentile, and drift indicators on input and output distributions. Plus an anomaly feed: any action that fired an alert. The dashboard is for the operator, not for the agent.

Three takeaways before you close this tab

Agents fail silently. Plausible-looking wrong output is the default failure mode.
Four layers: action, outcome, cost/latency, drift. Each answers a different question.
Sampling plus anomaly flagging. Exhaustive review does not scale; sampled review with anomaly flagging does.

Sources

Anthropic, "Building Effective Agents", retrieved 2026-05-07, anthropic.com/engineering/building-effective-agents
NIST, "AI Risk Management Framework 1.0", 2023, retrieved 2026-05-07, nist.gov/itl/ai-risk-management-framework
OpenTelemetry, "Semantic conventions for GenAI operations", retrieved 2026-05-07, opentelemetry.io/docs/specs/semconv/gen-ai
Aryan Agarwal, "Gravity observability spec", internal v1, May 2026, About