AI Agent Reliability Testing Explained: A 2026 Engineering Guide

Q: How do you test an AI agent for reliability?

Build a fixed eval set per capability (50-200 examples), score each model and prompt version on it, and gate promotion to production on the pass rate. Layer scenario tests for multi-step flows, chaos drills for failure modes, and a small canary rollout (1-5 percent of traffic) before full release.

Q: How many tests should an AI agent have?

There is no single right number; the rule is enough to cover the failure modes you care about. A common discipline is 50-100 cases per capability, with target pass rate above 90 percent. At Gravity we run 80+ tests per capability across baseline accuracy, edge cases, adversarial inputs, and end-to-end flows.

Q: What is chaos engineering for AI agents?

Deliberately injecting failures (tool timeouts, model rate limits, malformed responses, mid-run cancellation) into staging and measuring whether the agent degrades gracefully. The pattern is borrowed from Netflix's Chaos Monkey but adapted to agent-specific failure modes.

Q: How do you do regression testing on an LLM-based agent?

Pin a fixed eval set per capability. On every prompt change, model upgrade, or tool update, re-run the eval and compare pass rate against the previous version. Block promotion if pass rate drops more than a defined threshold (commonly 2 percentage points). The eval set is the regression contract.

Most engineers transition from testing deterministic services to testing agents and discover their old reflexes do not work. Unit tests assume identical inputs produce identical outputs. Agents do not. Integration tests assume a known happy path. Agents pick the path. End-to-end tests assume a closed surface. Agents reach for tools that change shape weekly.

The fix is not to discard the testing discipline. It is to extend it: eval sets replace assertion-based unit tests; scenario suites replace integration tests; chaos drills replace load tests. This is the working playbook I would hand to a team standing up reliability testing for their first agent in production in 2026.

Why testing agents differs from testing services

Three properties of agents force the change. First, non-determinism: two identical inputs may produce different outputs. The test must measure distribution of outcomes, not equality. Second, multi-step structure: the agent decides which tools to call. The test must verify behaviour, not specific calls. Third, external coupling: every tool call reaches out to a system whose shape the agent does not control. The test must distinguish "the agent did the wrong thing" from "the tool returned something unexpected".

The implication is that testing is statistical, not categorical. Pass rate on a held-out set is the unit of measurement, and the bar is the threshold the team commits to.

Eval sets and the regression contract

An eval set is a labelled collection of representative inputs with expected outputs. It is the agent's regression contract.

How to build one

Sample 50-200 examples from production traffic for each agent capability
Label expected outputs: exact match, regex, semantic match against gold, or rubric judgement
Version the set; treat changes as code (PR, review, changelog)
Stratify by difficulty: easy, typical, edge case, adversarial

The promotion gate

Every prompt change, model upgrade, or tool update triggers a re-run of the eval set. If pass rate drops more than a defined threshold (commonly 2 percentage points), promotion is blocked. The team either accepts the regression with explicit sign-off or fixes the issue.

Frameworks worth knowing

Stanford HELM provides a multi-metric evaluation framework for foundation models (Stanford HELM). OpenAI evals is an open-source framework for building eval sets (OpenAI evals). Anthropic published guidance on evaluating Claude in production (Anthropic eval guide). For agent-specific evals, Langsmith, Langfuse, and Arize Phoenix all include eval primitives.

Scenario tests for multi-step flows

An eval set tests one input at a time. A scenario test simulates a multi-step flow: tool call, response, follow-up call, decision, recovery. The scenario runs in a sandbox with mocked tools that respond on a script.

What a scenario looks like

"Customer requests a refund. Agent looks up order; finds order with status delivered. Agent applies refund policy; refund is within window; agent files refund through Stripe. Stripe returns 503. Agent retries once. Stripe returns success. Agent confirms to customer. Audit log entries present for all steps."

Each scenario asserts not just the final output but the intermediate behaviour: which tools were called, in what order, with which arguments.

Chaos drills for failure modes

Chaos engineering for agents borrows from Netflix's Chaos Monkey practice (Netflix Chaos Monkey) and adapts to agent-specific failure modes.

Failure modes to inject

Tool timeout (the call hangs past the configured timeout)
Tool returns malformed JSON
Model rate-limited at the provider
Model returns a refusal mid-chain
Network partition during a tool call
Mid-run cancellation (kill switch invoked)
Memory store unavailable

What graceful degradation looks like

The agent detects the failure, classifies it, applies the appropriate policy (see AI agent error handling and rollback), and either recovers within the bounded outcome, escalates to a human, or halts cleanly with a complete audit trail. None of these outcomes are "the agent kept calling the same broken tool until the budget ran out".

Judge models and rubric scoring

For outputs where exact or regex matching fails (free-form responses, summaries, drafts), use a judge model. The judge model receives the input, the expected output, the actual output, and a rubric, and scores the actual output on the rubric.

The pattern was formalised in MT-Bench (Zheng et al., 2023) and is now standard in production eval pipelines (Zheng et al., Judging LLM-as-a-Judge with MT-Bench, NeurIPS 2023). Caveats: judge models have their own biases (length bias, position bias) and should be validated against a small human-labelled subset.

Canary rollout and shadow traffic

Before any change reaches 100 percent of traffic, it goes through canary. 1-5 percent of production traffic uses the new version; the rest stays on the previous. Compare error rate, pass rate, latency, cost. Promote only if no regression on the headline metrics.

Shadow traffic is the stronger version: run both versions in parallel on the same inputs; compare outputs; only the previous version's output is served. Useful when you want to see how the new version behaves on production traffic without risking customer-visible regressions.

Test cadence by event

Event	What to run
Prompt change	Eval set for the affected capability; block on regression
Model upgrade	Full eval set across all capabilities; canary at 5 percent for 24 hours
Tool added or updated	Scenario suite for flows that touch the tool
Production drift detected	Targeted scenario test; root-cause investigation; eval set augment
Quarterly	Full chaos drill; kill switch SLA verification

How we run 80+ tests per capability at Gravity

Gravity ships every agent capability with at least 80 tests across four categories: baseline accuracy (the eval set), edge cases (rare but realistic inputs), adversarial inputs (prompt injection attempts, malformed tool responses), and end-to-end scenarios (multi-step flows with mocked tools). The pass-rate target is above 92 percent on baseline accuracy and above 85 percent on edge cases.

Why 80. It is the number that catches the bulk of practical failure modes without becoming an end in itself. Below that we missed regressions; above that, marginal cases dominated and the discipline turned into a chore. For the deeper view see how we run 80+ tests per agent capability.

Frequently asked questions

How do you test an AI agent for reliability?

Eval set per capability for regression. Scenario tests for multi-step flows. Chaos drills for failure modes. Canary before full rollout.

What is an eval set for an AI agent?

A labelled set of representative inputs with expected outputs (exact, regex, semantic, or rubric). Fixed, versioned, used as the promotion gate.

How many tests should an AI agent have?

Enough to cover the failure modes you care about. Common discipline is 50-100 cases per capability with pass rate above 90 percent. We run 80+ at Gravity.

What is chaos engineering for AI agents?

Deliberately injecting failures (timeouts, rate limits, malformed responses, cancellation) and measuring graceful degradation.

How do you do regression testing on an LLM-based agent?

Pin a fixed eval set. Re-run on every prompt, model, or tool change. Block promotion if pass rate drops more than the threshold (commonly 2 points).

Three things to ship this week

An eval set for your top agent capability. Start with 50 examples.
One scenario test for the end-to-end flow.
The promotion gate: block on regression above a threshold.

Sources

Stanford HELM, "Holistic Evaluation of Language Models", crfm.stanford.edu
OpenAI evals, GitHub, github.com/openai/evals
Anthropic, "Evaluate Claude in production", docs.anthropic.com
Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", NeurIPS 2023, arxiv.org
Netflix, "Chaos Monkey", netflix.github.io