AI Agent Evaluation Metrics: What "Good" Actually Looks Like

"Is the agent any good?" is the question every buyer asks and almost no buyer can answer with a number. The shortage of good answers is not because the metrics are unknown; it is because most vendors publish one or two metrics, the favourable ones, and the rest of the picture is not in the marketing. This piece is the metric stack a buyer actually needs to evaluate an agent honestly: what to look for, what each measures, and what each one alone hides.

The metric stack matters because agents are non-deterministic systems and single-shot tests do not characterise them. The 80-test methodology covered in how we test AI agents exists for exactly this reason. Here, we widen the lens from the testing methodology to the evaluation framework that wraps it.

The five core metrics

Every honest agent evaluation rests on five numbers. Each one has a story alone; together they describe whether an agent is ready for autonomous use.

Five metrics, three categories: outcome (1), behaviour (2, 5), economics (3, 4). All five matter; none is sufficient alone.

Task completion rate

The most-cited metric, and the easiest to over-cite. Task completion rate is the percentage of tasks the agent finishes successfully against a defined success criterion. Used well, it is a reasonable headline number. Used alone, it hides the things that break trust in production.

Three failure modes hide inside a high task completion rate. First, reliability variance: an agent that finishes 95% of tasks but with 30% spread between runs of the same task is unreliable in a way that headline number does not show. Second, refusal failures: an agent that finishes inappropriate tasks (sending emails based on injected instructions, transferring funds without authorisation) inflates the completion rate while creating risk. Third, partial completion masquerading as success: an agent that completes 80% of the steps and reports success has not actually completed the task.

The honest version of task completion rate splits success into "fully completed and verified" versus "claimed completion that did not survive verification". The verification step is the part most evaluations skip.

Reliability

Reliability is consistency. The same task, run repeatedly with the same inputs, should produce the same outcome to within a small variance. Reliability is what makes an agent usable for autonomous tasks: a 95% completion rate with 5% variance is workable; a 95% completion rate with 30% variance is not, because the buyer cannot predict whether the next run will succeed.

Reliability is measured through repeated testing across the eight failure categories described in the 80-test methodology: input variation, tool failure, partial results, hostile input, rate limits, schema drift, refusal correctness, and idempotency. The reliability number that matters is the weighted pass rate across all eight categories, with the weights set by failure cost. Idempotency failures cost more than input-variation failures, so they count more.

Public benchmarks give some signal here. GAIA reports human pass rates above 90% with the strongest evaluated AI agent systems below 50% on the harder levels (Mialon et al., GAIA, 2023). SWE-bench shows multi-step task pass rates dropping with each additional tool call (SWE-bench leaderboard). Both confirm that reliability under variation is the binding constraint, not raw single-shot capability.

Cost per task and latency distribution

Cost per task is the dollars (or tokens, or both) per agent execution. The number alone is meaningless; it only matters in the context of task economics. A sales-follow-up agent earning $5 expected revenue per send can absorb $0.10 per task and still be highly profitable. A high-volume data-extraction agent at 10,000 tasks per day needs sub-cent cost per task to make sense. The economics framework in economics of bootstrapped AI agents covers this in more detail; for evaluation, the rule is simple: cost per task must be a small fraction of value per task.

Latency distribution matters more than average latency. Read p50 (median) and p95 (the 95th percentile, the long tail). For an interactive agent, p95 latency is often what users actually feel; the average is misleading because a small number of slow tasks pull it disproportionately. For a batch agent, p95 might not matter much, but throughput (tasks per minute) does. The right metric depends on the agent's use; the wrong metric is "average latency reported once".

Refusal correctness

Refusal correctness is the metric most evaluations underweight. It measures whether the agent refuses what it should refuse and does not refuse what it should not refuse. Both halves matter equally.

The "refuse what it should" half is the safety side. An agent that follows instructions embedded in untrusted text (an email body, a web page, a document attachment) is a security incident vector; OWASP's LLM Top 10 lists prompt injection as the #1 risk for LLM-powered systems for this reason (OWASP Top 10 for LLM Applications). Measuring refusal correctness means feeding the agent a battery of prompt-injection attempts and counting how often it ignores the injection.

The "do not refuse what it should not refuse" half is the utility side. An agent that refuses too aggressively is unusable; over-cautious agents become a tax on the user, who has to either rephrase or give up. Measuring this means feeding the agent legitimate instructions phrased in unusual ways (the input-variation category) and counting how often it acts correctly versus refuses defensively.

The combined refusal-correctness rate captures both. Most public benchmarks measure neither half well, which is why agent vendors who publish a refusal-correctness number stand out. The 95% threshold described in the 80-test methodology is the binding constraint for shipping.

Reading the stack together

The five metrics tell different parts of the same story; reading them together is the evaluation skill. A few patterns worth recognising:

High completion, low reliability: demo-ready, not ship-ready. The agent works in the demo because the demo is run-once.
High completion, high reliability, weak refusal: dangerous. The agent does what it is told, including what it should not be told.
High everything except cost: works but uneconomic. Either the agent runs too few tools at high cost or the buyer's task economics do not support the price.
High everything except latency p95: works on average; the long tail breaks user trust over time.

The five-metric framework is also how vendors should publish honestly. Vendors that publish task completion rate without reliability variance, refusal correctness, or latency p95 are publishing the favourable subset. Buyers who learn to ask "and what about p95? and what is your reliability spread? and what is your refusal-correctness rate against prompt injection?" filter out the marketing-tone evaluations from the methodologies.

Across three startups, the pattern that held was that the products that earned trust did so by publishing the unflattering numbers as well as the flattering ones. Buyers reward honest measurement; they punish discovered gaps. The evaluation framework is the artefact that makes honesty operationally tractable.

Frequently asked questions

What metrics are used to evaluate AI agents?

Five core metrics: task completion rate (does the agent finish?), reliability (consistency across runs of the same task), cost per task (tokens, dollars, latency), latency distribution (p50 and p95, not just average), and refusal correctness (the agent refuses when it should and acts when it should).

Why is task completion rate not enough to evaluate an AI agent?

Task completion rate alone hides reliability variance and refusal failures. An agent that completes 95 percent of tasks but with 30 percent variance run-to-run is unreliable. An agent that completes 95 percent of tasks but executes inappropriate ones is dangerous. The metric stack has to capture all three: did it finish, did it finish consistently, did it finish only when it should.

How is AI agent reliability measured?

Reliability is the consistency of behaviour across repeated runs of the same task with the same inputs. The standard measurement is pass rate across 80 or more tests covering input variation, tool failure, partial results, hostile input, rate limits, schema drift, refusal correctness, and idempotency. The aggregate is weighted by failure cost, not weighted equally.

What is a good cost per task for an AI agent?

Depends on the task economics. A sales-follow-up agent earning $5 expected revenue per task can absorb $0.10 per task and still be net positive. A high-volume data-extraction agent at thousands of tasks per day needs sub-cent cost per task to make sense. The right benchmark is task economics, not absolute cost.

Why does refusal correctness matter as a metric?

An agent that completes everything you ask, including the things you should not have asked, is a security incident waiting to happen. Refusal correctness measures both directions: does the agent refuse instructions that come from untrusted text, and does the agent NOT refuse legitimate instructions out of over-caution. Both halves are equally important to ship-readiness.

Three takeaways before you close this tab

Read all five metrics together. Single metrics mislead.
Reliability variance and refusal correctness are the often-skipped numbers. Both separate ship-ready from demo-ready.
Cost per task only makes sense in the context of value per task. The framework is task economics, not absolute cost.

Sources

Mialon et al., "GAIA: A Benchmark for General AI Assistants", arXiv:2311.12983, 2023, retrieved 2026-05-07, arxiv.org/abs/2311.12983
SWE-bench, "Leaderboard", retrieved 2026-05-07, swebench.com
OWASP, "Top 10 for LLM Applications", retrieved 2026-05-07, owasp.org
NIST, "AI Risk Management Framework", retrieved 2026-05-07, nist.gov/itl/ai-risk-management-framework