Most engineers transition from testing deterministic services to testing agents and discover their old reflexes do not work. Unit tests assume identical inputs produce identical outputs. Agents do not. Integration tests assume a known happy path. Agents pick the path. End-to-end tests assume a closed surface. Agents reach for tools that change shape weekly.
The fix is not to discard the testing discipline. It is to extend it: eval sets replace assertion-based unit tests; scenario suites replace integration tests; chaos drills replace load tests. This is the working playbook I would hand to a team standing up reliability testing for their first agent in production in 2026.
Why testing agents differs from testing services
Three properties of agents force the change. First, non-determinism: two identical inputs may produce different outputs. The test must measure distribution of outcomes, not equality. Second, multi-step structure: the agent decides which tools to call. The test must verify behaviour, not specific calls. Third, external coupling: every tool call reaches out to a system whose shape the agent does not control. The test must distinguish "the agent did the wrong thing" from "the tool returned something unexpected".
The implication is that testing is statistical, not categorical. Pass rate on a held-out set is the unit of measurement, and the bar is the threshold the team commits to.
Eval sets and the regression contract
An eval set is a labelled collection of representative inputs with expected outputs. It is the agent's regression contract.
How to build one
- Sample 50-200 examples from production traffic for each agent capability
- Label expected outputs: exact match, regex, semantic match against gold, or rubric judgement
- Version the set; treat changes as code (PR, review, changelog)
- Stratify by difficulty: easy, typical, edge case, adversarial
The promotion gate
Every prompt change, model upgrade, or tool update triggers a re-run of the eval set. If pass rate drops more than a defined threshold (commonly 2 percentage points), promotion is blocked. The team either accepts the regression with explicit sign-off or fixes the issue.
Frameworks worth knowing
Stanford HELM provides a multi-metric evaluation framework for foundation models (Stanford HELM). OpenAI evals is an open-source framework for building eval sets (OpenAI evals). Anthropic published guidance on evaluating Claude in production (Anthropic eval guide). For agent-specific evals, Langsmith, Langfuse, and Arize Phoenix all include eval primitives.
Scenario tests for multi-step flows
An eval set tests one input at a time. A scenario test simulates a multi-step flow: tool call, response, follow-up call, decision, recovery. The scenario runs in a sandbox with mocked tools that respond on a script.
What a scenario looks like
"Customer requests a refund. Agent looks up order; finds order with status delivered. Agent applies refund policy; refund is within window; agent files refund through Stripe. Stripe returns 503. Agent retries once. Stripe returns success. Agent confirms to customer. Audit log entries present for all steps."
Each scenario asserts not just the final output but the intermediate behaviour: which tools were called, in what order, with which arguments.
Chaos drills for failure modes
Chaos engineering for agents borrows from Netflix's Chaos Monkey practice (Netflix Chaos Monkey) and adapts to agent-specific failure modes.
Failure modes to inject
- Tool timeout (the call hangs past the configured timeout)
- Tool returns malformed JSON
- Model rate-limited at the provider
- Model returns a refusal mid-chain
- Network partition during a tool call
- Mid-run cancellation (kill switch invoked)
- Memory store unavailable
What graceful degradation looks like
The agent detects the failure, classifies it, applies the appropriate policy (see AI agent error handling and rollback), and either recovers within the bounded outcome, escalates to a human, or halts cleanly with a complete audit trail. None of these outcomes are "the agent kept calling the same broken tool until the budget ran out".
Judge models and rubric scoring
For outputs where exact or regex matching fails (free-form responses, summaries, drafts), use a judge model. The judge model receives the input, the expected output, the actual output, and a rubric, and scores the actual output on the rubric.
The pattern was formalised in MT-Bench (Zheng et al., 2023) and is now standard in production eval pipelines (Zheng et al., Judging LLM-as-a-Judge with MT-Bench, NeurIPS 2023). Caveats: judge models have their own biases (length bias, position bias) and should be validated against a small human-labelled subset.
Canary rollout and shadow traffic
Before any change reaches 100 percent of traffic, it goes through canary. 1-5 percent of production traffic uses the new version; the rest stays on the previous. Compare error rate, pass rate, latency, cost. Promote only if no regression on the headline metrics.
Shadow traffic is the stronger version: run both versions in parallel on the same inputs; compare outputs; only the previous version's output is served. Useful when you want to see how the new version behaves on production traffic without risking customer-visible regressions.
Test cadence by event
| Event | What to run |
|---|---|
| Prompt change | Eval set for the affected capability; block on regression |
| Model upgrade | Full eval set across all capabilities; canary at 5 percent for 24 hours |
| Tool added or updated | Scenario suite for flows that touch the tool |
| Production drift detected | Targeted scenario test; root-cause investigation; eval set augment |
| Quarterly | Full chaos drill; kill switch SLA verification |
How we run 80+ tests per capability at Gravity
Gravity ships every agent capability with at least 80 tests across four categories: baseline accuracy (the eval set), edge cases (rare but realistic inputs), adversarial inputs (prompt injection attempts, malformed tool responses), and end-to-end scenarios (multi-step flows with mocked tools). The pass-rate target is above 92 percent on baseline accuracy and above 85 percent on edge cases.
Why 80. It is the number that catches the bulk of practical failure modes without becoming an end in itself. Below that we missed regressions; above that, marginal cases dominated and the discipline turned into a chore. For the deeper view see how we run 80+ tests per agent capability.
Frequently asked questions
How do you test an AI agent for reliability?
Eval set per capability for regression. Scenario tests for multi-step flows. Chaos drills for failure modes. Canary before full rollout.
What is an eval set for an AI agent?
A labelled set of representative inputs with expected outputs (exact, regex, semantic, or rubric). Fixed, versioned, used as the promotion gate.
How many tests should an AI agent have?
Enough to cover the failure modes you care about. Common discipline is 50-100 cases per capability with pass rate above 90 percent. We run 80+ at Gravity.
What is chaos engineering for AI agents?
Deliberately injecting failures (timeouts, rate limits, malformed responses, cancellation) and measuring graceful degradation.
How do you do regression testing on an LLM-based agent?
Pin a fixed eval set. Re-run on every prompt, model, or tool change. Block promotion if pass rate drops more than the threshold (commonly 2 points).
Three things to ship this week
- An eval set for your top agent capability. Start with 50 examples.
- One scenario test for the end-to-end flow.
- The promotion gate: block on regression above a threshold.
Sources
- Stanford HELM, "Holistic Evaluation of Language Models", crfm.stanford.edu
- OpenAI evals, GitHub, github.com/openai/evals
- Anthropic, "Evaluate Claude in production", docs.anthropic.com
- Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", NeurIPS 2023, arxiv.org
- Netflix, "Chaos Monkey", netflix.github.io