What goes on an AI agent observability dashboard?

Five panels: run health (success, latency, errors), token economics, tool-call success per tool, quality signal (eval pass rate or LLM-judge), and cost per completed run. Everything else is detail you drill into from these five.

Which tracing standard should I use?

OpenTelemetry GenAI semantic conventions are the emerging standard. They define span attributes for model, tokens, tool calls, and errors so the trace works across vendors. Use them even if your current tool does not enforce them yet.

What is the difference between traces and metrics for agents?

A trace is one run end-to-end: every model call, tool call, retrieval query, with timings and inputs. Metrics are aggregates over many runs: rates, percentiles, counts. You need both. Traces explain why; metrics show how often.

Do I need a special LLM observability vendor?

Not strictly. Many teams pipe OpenTelemetry traces into existing observability stacks (Datadog, Honeycomb, Grafana). Specialized vendors (Langfuse, Helicone, Phoenix, LangSmith) add prompt-aware features but you can start with what you already run.

How much of the prompt and completion should be logged?

Enough to debug, not enough to be a liability. Most teams log inputs and outputs by default for non-PII workloads and redact for PII. Retention is bounded; access is role-gated; the policy is documented for SOC 2.

What alerts should fire from the dashboard?

p95 latency over budget for 10 minutes. Error rate doubling over the prior hour. Tool-call success dropping below threshold per tool. Cost per run jumping above headroom. Quality proxy regressing on the canary cohort.

AI Agent Observability Dashboards: The Five Panels Every Team Needs

An AI agent dashboard does two jobs. It tells the on-call within a minute whether the platform is healthy. It tells a debugger within five minutes why a specific run went wrong. Most dashboards are good at one or the other; the good ones are good at both. Companion to the broader monitoring overview and the post on log aggregation patterns.

This piece names the five panels that earn their place on a primary dashboard, the tracing standard underneath, the alerts that should fire, and what to leave for a secondary view.

The bones: traces, metrics, logs

Three telemetry types underpin every agent dashboard.

Traces. One trace per agent run, spanning the orchestrator, every model call, every tool call, every retrieval query. OpenTelemetry's GenAI semantic conventions define standard span attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, tool span events) so traces work across model providers and frameworks (OpenTelemetry GenAI, 2025).

Metrics. Aggregates derived from traces or emitted directly. Rates, percentiles, counts. Lower cardinality than traces; queryable in seconds.

Logs. Structured events for things that do not fit a trace or a metric: deploy events, config changes, quality-eval results, support tickets. Logs are the breadcrumbs that turn a "p95 just spiked" question into a "we deployed at 14:02, p95 spiked at 14:03" answer.

Panel 1: Run health

The first thing the on-call sees.

Runs per minute (volume).
Success rate (completed without error).
End-to-end latency at p50, p95, p99.
Error rate broken down by class: 5xx from provider, 429 rate-limited, tool-call failure, parse failure, timeout.

The cardinality stays low because errors are bucketed. The trick is the latency view: show distribution, not just averages. A heatmap or a histogram says "the p99 is climbing" in a glance; a single line chart hides it.

Panel 2: Token economics

Tokens are the fuel; the panel that shows how much you are burning and where.

Input tokens per minute, output tokens per minute.
Cache hit ratio (cached input tokens divided by total input tokens). Anthropic, OpenAI, and Google all expose this and it directly drives cost.
Average tokens per run, broken down by agent or capability.
Top 5 agents by token spend in the last hour. Surfaces the runaway agent before the bill does.

Panel 3: Tool-call success

Tools are the part of the agent that interacts with the rest of the world. Most failures that look like "agent broke" are actually "tool returned 5xx".

Tool call rate per tool.
Success rate per tool.
Latency per tool, p95.
Top 5 failing tools in the last hour.

Per-tool breakdown is the point. An aggregate success rate of 98 percent might hide a specific tool dropping from 99 percent to 80 percent. The aggregate looks fine; the affected workflow does not. See fallback and retry for what to do when a tool's success rate drops.

Panel 4: Quality signal

Latency and errors do not tell you whether the output is good. Quality needs its own signal.

Eval pass rate on a recurring sample (held-out set re-run hourly or daily).
LLM-as-judge score on a live sample, if used.
Human-feedback signal: thumbs, downstream task completion, support ticket rate.
Hallucination flag rate if you run a separate detector.

Quality is the metric that lags the most. A regression here often shows up on canary or shadow traffic before the live dashboard catches it; build both views.

Panel 5: Cost per completed run

The CFO's panel and the engineer's panel.

Cost per completed run (rolling 1-hour and 24-hour).
Cost per tenant or per capability, top N.
Budget headroom against the monthly commit.
Forecast based on current run rate.

Live cost is the cheapest control over a runaway agent. A 50-cents-per-run pattern that jumps to USD 8 per run for an hour is visible on this panel before it shows on next month's invoice. See agent cost attribution for how the per-tenant view is built.

Tooling: build, buy, or both

Three viable stacks.

General observability platform. Datadog, Honeycomb, Grafana plus Tempo/Loki, New Relic. Existing pipeline, OpenTelemetry-native, the team already knows the tool.
LLM-specialized. Langfuse, Helicone, Phoenix (Arize), LangSmith, Traceloop. Prompt and completion as first-class citizens; faster to get started on agent-specific views.
Both. LLM-specialized for prompt-level debugging and quality, general platform for infrastructure metrics. Pipe with OpenTelemetry so traces flow to both.

The decision is usually shaped by what the team already runs. Adding a parallel stack only for LLMs is rarely necessary in the first six months; pipe to your existing platform and add a specialized tool when prompt-aware features start to matter.

Alerts that fire from the dashboard

p95 end-to-end latency over budget for 10 minutes.
Error rate doubling over the prior hour, broken down by class.
Per-tool success rate dropping below the configured floor.
Cost per run exceeding the budget headroom over a 15-minute window.
Quality proxy regressing more than the agreed band on canary or live sample.

Five alerts is enough. Each one paged means "investigate, do not just acknowledge". Alerts beyond this list belong on a secondary dashboard or an on-call queue; paging more than five conditions trains the team to acknowledge without looking.

What goes on the secondary view

The primary dashboard is for "is the platform healthy". Everything else lives on secondary views the on-call drills into when the primary says no.

Per-tenant breakdown. The same five panels, but pivoted by tenant. Surfaces the noisy neighbor: one tenant whose runs cost 5x the average is visible here before they show in the aggregate.

Per-agent breakdown. Same five panels, pivoted by agent. The agent whose tool-call success dropped is identified by name, not by inference.

Provider health. Side-by-side comparison of your service's view of provider latency versus the provider's status page. Catches issues where the provider's status is green but your latency to them is not.

Retrieval index health. Index size, write rate, p95 query latency, recall@k from a tiny held-out set. Index drift is hard to see from the run dashboard alone.

Cost forecast. Burn rate projected to month-end, broken down by tenant and capability. Useful for finance reviews and capacity planning.

Common dashboard pitfalls

Four patterns that turn a useful dashboard into theatre.

Too many panels. A dashboard with 30 panels is read by no one. Five primary, ten secondary; everything else is a saved query.

Averages instead of distributions. A mean latency line hides every interesting failure. Use percentiles and histograms.

No deploy markers. A latency spike at 14:02 is much easier to diagnose if the dashboard shows "deploy at 14:01". Annotate deploys, feature flags, and canary expansions.

Stale alerts. Alerts that fire weekly but get acknowledged without action are noise. Review the alert backlog quarterly; either tighten the threshold or delete the alert.

FAQ

What goes on an AI agent observability dashboard?: Five panels: run health, token economics, tool-call success per tool, quality signal, and cost per completed run.
Which tracing standard should I use?: OpenTelemetry GenAI semantic conventions. They define span attributes for model, tokens, tool calls, and errors so traces work across vendors.
What is the difference between traces and metrics for agents?: A trace is one run end-to-end. Metrics are aggregates over many runs. You need both. Traces explain why; metrics show how often.
Do I need a special LLM observability vendor?: Not strictly. Many teams pipe OpenTelemetry traces into existing platforms. Specialized vendors add prompt-aware features; start with what you already run.
How much of the prompt and completion should be logged?: Enough to debug, not enough to be a liability. Log by default for non-PII; redact for PII. Retention bounded, access role-gated, policy documented.
What alerts should fire from the dashboard?: p95 over budget for 10 minutes. Error rate doubling. Tool-call success below floor. Cost per run above headroom. Quality regressing on canary.

Sources

OpenTelemetry, "GenAI semantic conventions", 2025, opentelemetry.io
Anthropic, "Prompt caching", 2025, docs.anthropic.com
Langfuse, "Open-source LLM observability", 2025, langfuse.com
Arize Phoenix, "LLM observability", 2025, arize.com
Google, "SRE: Monitoring distributed systems", sre.google