Monitoring an AI agent is monitoring a non-deterministic distributed system. The four classic golden signals (latency, traffic, errors, saturation) translate. You need three more: token cost, tool success, and quality drift. OpenTelemetry's GenAI semantic conventions, stable as of late 2025, give the trace schema (OpenTelemetry GenAI, 2025). Langfuse, Langsmith, Arize, Helicone, and Datadog LLM Observability give the platforms. This playbook covers what to log, how to structure traces for multi-step agents, latency budgeting, drift detection, alerting thresholds, debugging patterns, integration health, and compliance logging.

It closes with a cost-vs-detail tradeoff table so retention bills do not outgrow inference bills. The single biggest mistake is to retain everything at full fidelity forever; the second is to retain too little to debug last week's incident.

What should you log for an AI agent?

Three layers. Per-run: trace ID, agent ID, tenant ID, user ID, capability ID, total tokens (in, out, cached), total cost, total tool calls, outcome (success / partial / failure / escalated). Per-step: step type, model, tokens in/out, prompt hash, tool name if any, tool latency, tool result code, judge score where applicable. Per-tool-call: idempotency key, attempt count, response code, payload size. Sample full prompts and outputs at 1 to 5 percent of traffic with full PII redaction. PHI must be redacted before storage on healthcare workloads (HHS HIPAA Security Rule).

How to structure traces for multi-step agents

One trace per run. Nested spans for each step, model call, and tool invocation. The parent span carries the run outcome. Child spans carry per-step detail. Walking the tree reconstructs the agent's reasoning path, which is what you need at 2am when a customer asks why the agent did the thing. OTel GenAI defines the attribute names so the trace is portable across vendors and across Langfuse, Langsmith, Arize, Helicone, and Datadog (OpenTelemetry GenAI, 2025).

Token and cost metrics

Three metrics. Tokens per run (in, out, cached separately). Dollars per run, computed from a price card kept in version control. Cost per outcome (success vs failure vs escalated). The last one is the metric finance asks about. Aggregate by tenant, user, capability, and model. If cached tokens are not a separate axis, you cannot optimize the prompt structure for cache hit rate.

Latency budgets

Set a budget per agent class. Chat-style: p50 under 3s, p95 under 8s. Background workflow: p95 under 30s. Async multi-step: per-step budget, not whole-run. Alert on p95 breach, not on a single slow run. Tail latency on agents is often dominated by one slow tool, not by the model; instrument both separately.

How to detect AI agent quality drift

Three signals run continuously. First, a labeled gold-set eval that runs nightly on production prompts; pass rate is the headline. Second, a judge model evaluating a 1 to 5 percent sample of live traffic; agreement with the gold set on the same inputs catches judge drift. Third, distribution shifts: input length, tool selection mix, refusal rate, output length. Alert when any signal moves beyond its band. Anthropic, OpenAI, and Google all model-update silently from time to time; drift detection is your protection.

Right alerting thresholds

  1. Error rate > 2 percent over 10 minutes, page on-call. 5 percent, escalate.
  2. p95 latency > 1.5x baseline for 15 minutes, page.
  3. Token cost per run > 2x baseline for 1 hour, alert (likely a prompt change or context bloat).
  4. Tool success rate < 95 percent on any tool for 10 minutes, alert.
  5. Quality drift: gold-set pass rate down > 3 points week-over-week, alert.
  6. Refusal rate change > 50 percent week-over-week, alert (likely a model update).

How to debug a failed agent run

The trace is the debugger. Pull the run by trace ID, walk the span tree, find the step where the outcome diverged from intent. Common patterns: (1) tool returned wrong arg shape and the model retried into a loop; (2) context exceeded a downstream tool's input limit; (3) model misclassified the intent and routed to the wrong specialist; (4) idempotency key collision caused a phantom failure on retry. The trace tells you which one in under five minutes if the schema is right.

Tool and integration health

Each external tool has its own SLO. Track success rate, p95 latency, and error code distribution per tool. A tool with degraded SLO should trip its circuit breaker before it eats your retry budget. The agent's own success rate is a lagging indicator of tool health; the per-tool metrics are leading.

The cost-vs-detail tradeoff for retention

Retention is where observability budgets blow up. Three tiers work for most teams. Tier 1 (full fidelity, 30 days): every trace, every prompt, every output, redacted PII, sampled tools. Tier 2 (summary, 90 days): metadata, totals, error codes, no payloads. Tier 3 (aggregates, 12+ months): per-day rollups by tenant, capability, model. The 30-day full-fidelity window covers the "what happened last week" debugging case; the 90-day summary covers "is this drift real"; the 12-month aggregates cover business reviews and compliance. Skipping any tier creates a debugging hole or a compliance one.

Pricing reality on the major platforms in 2025-2026: Langfuse and Helicone charge by event count and retention; Arize and Datadog charge by ingestion volume and retention separately. A team retaining 100 percent at full fidelity for 90 days usually finds the bill exceeds the inference bill within a quarter. Sampling to 1 to 5 percent of full payloads while keeping 100 percent of metadata is the standard tradeoff.

Anti-patterns we keep seeing

  1. Logging the full prompt and output without redaction. A compliance incident waiting to happen. PII redaction at write, not at read.
  2. Alerting on every error. Pager fatigue inside two weeks. Alert on rate-and-window, not on point events.
  3. Metrics in three places. One tool for cost, one for latency, one for quality. Pick one platform or accept that nobody owns the dashboard.
  4. Per-step traces without a parent. Cannot reconstruct a run. Always parent the spans.
  5. No judge sampling. You see latency but not quality. Latency stable while quality silently drifts.

Compliance and audit logging

Separate audit logs from operational logs. Audit log entries: who, when, what action, what data scope, outcome, signed and tamper-evident. Hash-chain entries so an attacker cannot rewrite history undetected. NIST AI RMF emphasizes documented impact assessment; SOC 2 and ISO 27001 both require tamper-evident audit trails for any system that touches customer data (NIST AI RMF, 2023).

FAQ

What should I log for an AI agent?
Per-run identity and totals; per-step model, tokens, prompt hash, tool details, judge scores; per-tool-call idempotency key, attempt count, response code. Sample full prompts and outputs at 1 to 5 percent, PII redacted.
What are the golden signals?
Latency, traffic, errors, saturation, token cost, tool success, quality drift. Seven, not four.
How do I detect quality drift?
Nightly gold-set eval, judge sampling on live traffic, distribution shift tracking. Alert when any moves beyond band.
What is OpenTelemetry GenAI?
Semantic conventions for generative AI traces. Standardizes attribute names across vendors. Stable late 2025.
How do I monitor multi-step runs?
One trace per run, nested spans per step and per tool call. Walk the tree to reconstruct reasoning.

Choosing an observability platform in 2026

Five platforms cover most teams. Langfuse is the open-core option with strong eval features; good for teams that want to self-host. Langsmith ties tightly to LangChain and LangGraph; good if you are committed to that stack. Arize is the most data-science-leaning, with the strongest drift detection. Helicone is the lightest-weight, proxy-based; cheapest to add but shallow on multi-step traces. Datadog LLM Observability folds into existing Datadog accounts; good for teams already paying Datadog and wanting one bill.

Selection criteria, ranked. One: does it understand multi-step agent traces or just LLM calls? Many "LLM observability" tools only see individual model calls and miss the agent loop. Two: does it implement OpenTelemetry GenAI semantic conventions? Portability matters when vendors change pricing. Three: how does the cost scale at your run volume? Some tools price by trace event, some by ingestion volume, some by user seat. The right answer at 1,000 runs per day is rarely the right answer at 100,000 per day.

Closing the loop

Observability for agents is not optional. It is the difference between "we shipped it and prayed" and "we shipped it and know". Related: cost control tactics, retry policies, reliability testing, and the broader security playbook.

Sources