Every Gravity capability ships through 80+ tests across 8 failure categories before it goes near a real user. Not because we love testing,because AI agents are non-deterministic, and "it worked once" is not a shipping criterion for a system that runs autonomously for someone else. This is the methodology, the categories, the measurement model, and the reason no comparable platform publishes their numbers.
The reliability problem is real. Agents that "stop after one task" are the modal failure mode for autonomous AI in 2026, and public benchmarks confirm the pattern: GAIA (a general AI assistant benchmark) reports human pass rates above 90% but the strongest evaluated AI agent systems below 50% on the harder levels (Mialon et al., GAIA benchmark, 2023). SWE-bench similarly shows multi-step task completion drops sharply with each additional tool call (SWE-bench leaderboard, retrieved 2026-05-05). The 80-test methodology exists to surface where reliability drops on a specific capability, before users find it.
Why 80, not 8
AI agents are non-deterministic. The same input can produce different outputs run-to-run, particularly when the agent calls tools and reasons across multiple steps. The point is empirical: agent benchmarks (GAIA, SWE-bench, AgentBench) report wide spread between best and worst runs of the same agent on the same task,single-shot tests do not characterise that distribution. A single test of "does the agent send the email?" tells you nothing about reliability. It tells you the run you watched succeeded.
The number 80 comes from a coverage calculation: 8 failure categories × 10 representative tests per category. Ten is the smallest number that produces stable pass-rate estimates with reasonable variance for a non-deterministic system. Below ten, the pass rate is noisy enough that "passed" and "failed" can flip between identical test runs. The 80-floor is the smallest number that gives a number you can defend in a postmortem.
The number is also a discipline. "We tested it" is a shipping criterion only if it has a number attached. 80 tests forces the team to commit to which 80,which inputs, which tools, which failure modes. Choosing the 80 is harder than running them.
The eight test categories
Every capability runs through these eight categories before shipping. Each category catches a distinct failure mode. The categories are not equally weighted in the final reliability score,refusal and idempotency failures carry more weight because their downstream consequences are larger.
- Input variation. Ten paraphrases of the same intent. Catches over-fitting to a specific phrasing. Example: "send a follow-up to leads who haven't replied in 5 days" versus "ping unresponsive leads from last week".
- Tool failure. Downstream API returns 5xx, times out, or returns malformed payload. Catches brittle assumptions about tool availability.
- Partial results. The agent completes some steps and not others. Catches "stop after one task" failure mode where the agent partially executes and then halts.
- Hostile input. Prompt injection, jailbreak, social-engineering attempts inside emails or web content. Catches agents that follow instructions from untrusted text.
- Rate limits. The agent hits a quota mid-task. Catches whether the agent backs off, retries, or fails open in unsafe ways.
- Schema drift. A downstream API changed its response shape. Catches whether the agent fails fast or hallucinates around the missing field.
- Refusal correctness. The agent should refuse and does. Catches over-compliance,agents that execute instructions they should not.
- Idempotency. Running the same task twice does not double-execute. Catches "send email twice" or "charge twice" risks.
How we measure reliability
Reliability is the weighted pass rate across all 80+ tests. Each category has a weight reflecting the cost of a failure in that category. A failure in idempotency (20% weight),where the agent might charge a customer twice or send an email twice,is more expensive than a failure in input variation (6% weight) where the agent slightly misreads a paraphrase and the user re-prompts.
The shipping rule is concrete: weighted pass rate ≥ 95% across all eight categories. "Across all" matters,95% overall is not the same as 95% per category. A capability that scores 99% on input variation and 80% on refusal does not ship. The high-weight category is the binding constraint.
The number gets re-checked monthly. Models change, APIs change, hostile-input techniques evolve. The 80 is not a one-time gate; it's an ongoing assertion. Capabilities that drop below 95% on any category get pulled out of production until the failure is patched.
The Friday testing block
The 80-test methodology lives inside the weekly cadence described in bootstrapping an AI agent platform in 2026. Monday through Thursday: build the capability. Friday: run the 80 tests in parallel, review failures sequentially, decide ship-or-delay before close of business.
The 80 runs themselves take about four hours on parallel infrastructure. The human review of ambiguous failures takes another two to three hours. For most capabilities, the full block fits in one working day. For capabilities with complex tool chains,where a failure in step 4 might be caused by step 2,review can stretch to a second day.
The discipline that matters most is what happens when a category fails. The default assumption is that the failure is real and the capability does not ship. The exception is when the failure traces to a downstream provider issue we cannot patch from our side; in that case the capability ships with a documented circuit-breaker that disables it when the upstream issue recurs.
Why publish the methodology
Most competitors do not publish their reliability methodology. Two reasons. First, most have not formalised what "reliability" means for non-deterministic systems. Second, the numbers that do exist are often worse than the marketing, and publishing them creates a benchmark competitors can beat.
Publishing the methodology is the safer move long-term. It sets a standard rather than chases one. It gives users a way to compare beyond brand. It also forces internal discipline,if the methodology is public, you cannot quietly lower the bar when a release is late. The 80-test number is part of the Gravity public commitment for that reason.
If you're building agent reliability tooling and want to compare notes,or if you have a failure mode I haven't listed in the eight categories,my email is at the top of /contact. The framework that produced this discipline is in the three checks I missed; the reliability discipline is the operational expression of the 10x check.
Frequently asked questions
Why test AI agents 80+ times per capability?
AI agents are non-deterministic. The same prompt can produce different outputs, especially in tool-use and multi-step reasoning. Single-shot tests do not characterise reliability. 80+ tests across deterministic categories (input variation, tool failure, partial results, hostile input, rate limits, schema drift, refusal correctness, idempotency) give you a real reliability number, not a vibes-based one.
What are the eight test categories for AI agents?
Input variation (paraphrases of the same intent), tool failure (a downstream API returns an error or times out), partial results (the agent completes some steps and not others), hostile input (jailbreak or injection attempts), rate limits (the agent hits a quota), schema drift (the API changed shape), refusal correctness (the agent should refuse and does), and idempotency (running the same task twice does not double-execute).
How do you measure AI agent reliability?
Reliability is the pass rate across the 80+ tests, weighted by the cost of each failure mode. Input variation failures are cheap; idempotency and refusal failures are expensive. The capability ships when the weighted pass rate exceeds 95% across all eight categories,not when the simple pass rate exceeds 95% overall.
How long does the 80-test methodology take?
About four hours of automated runs per capability, with another two to three hours of human review for ambiguous failures. The runs are parallel; the review is sequential. The full Friday testing block fits in one working day for most capabilities, two days for capabilities with complex tool chains.
Why don't competitors publish their reliability numbers?
Two reasons. First, most agent platforms have not formalised what "reliability" means for non-deterministic systems, so they have nothing to publish. Second, the numbers that exist are often worse than the marketing, and publishing them creates a benchmark competitors can beat. Publishing your own numbers is how you set the standard rather than chase it.
Three takeaways before you close this tab
- 80 is the smallest number that gives a reliability rate you can defend. Ten tests per category × eight categories.
- Weight categories by cost of failure, not failure rate. Idempotency and refusal carry more weight than paraphrase variation.
- Ship-or-delay is per-category. 95% overall and 80% on refusal is still "delay".
Sources
- Aryan Agarwal, "Gravity AI reliability methodology", internal spec v1, May 2026, About
- Mialon et al., "GAIA: A Benchmark for General AI Assistants", arXiv:2311.12983, 2023, retrieved 2026-05-05, arxiv.org/abs/2311.12983
- SWE-bench, "Leaderboard for software engineering benchmark", retrieved 2026-05-05, swebench.com
- NIST, "AI Risk Management Framework", retrieved 2026-05-05, nist.gov/itl/ai-risk-management-framework
- OWASP, "Top 10 for LLM Applications", retrieved 2026-05-05, owasp.org