Most agent demos work. That is the problem. A demo is one run, on a clean input, with the builder watching. The moment an agent goes live it meets empty fields, rate limits, contradictory instructions, and the one customer record that breaks every assumption. Gravity's answer is a quality bar: more than 80 tests per capability that an agent has to pass before it can publish and earn a single credit. This post explains that bar from the side that actually has to clear it, the builder's.
If you want the engineering detail of how the suite is built, read how we run 80+ tests per agent capability. This piece is about why the bar exists, what it means for the people publishing agents, and how it connects to the builder economy inside Gravity.
Why a marketplace needs a bar at all
A chatbot that gives a wrong answer wastes a few seconds. An agent that takes a wrong action sends the wrong invoice, deletes the wrong file, or messages the wrong customer. The cost of being wrong is higher because the output is a side effect in the real world, not a paragraph on a screen. That single difference is why I will not run a marketplace where anyone can publish anything and let users find the failures.
The industry data backs the caution. In McKinsey's most recent global survey on AI, the share of organizations regularly using generative AI jumped to roughly two thirds, yet the same respondents report that inaccuracy and unintended outputs remain among the most cited risks they have experienced. Adoption is racing ahead of reliability. A marketplace that wants agents to be trusted with real actions has to close that gap deliberately, because the market will not close it on its own.
So Gravity puts the gate where it belongs: before publishing, not after a complaint. The bar is the same for a first-time builder and for an experienced one. It is not a reputation system that lets trusted accounts skip checks. The capability passes the suite or it waits.
What the 80 tests actually cover
The number 80 is a floor for a serious capability, and it is not arbitrary. It is what it takes to cover the categories of failure that actually show up in production. The tests fall into five buckets.
- The happy path, many ways. The same task with different valid inputs, phrasings, and data shapes. An agent that only works when the instruction is worded exactly one way is not finished.
- Edge inputs. Empty fields, enormous fields, unusual characters, the wrong currency, a date in the past, a duplicate record. The boring inputs that real systems are full of.
- Ambiguity. Instructions that could mean two things. A good agent asks or picks the safe interpretation. A bad one guesses and acts.
- Permission and blast radius. Tests that confirm the agent stops at the edge of what it is allowed to touch, and never widens its own access. This connects directly to agent security practices.
- Failure handling. A downstream API times out, returns an error, or rate-limits. The agent should retry sensibly, roll back cleanly, or surface the problem, never silently leave a half-finished action behind. See error handling and rollback for the patterns.
Those five buckets map onto the standard ways teams measure agent quality, the same task success, tool-call accuracy, and safe-failure metrics covered in agent evaluation metrics. The 80 is just what honest coverage of those buckets costs for a capability that is allowed to act.
What it looks like from the builder's side
From inside the builder tools, the bar is not a black box that says no. As you build a capability, you write and run its tests alongside it. When you submit for publishing, Gravity runs an independent evaluation suite, and you get a result that names every case that failed and what the agent did instead of the expected outcome. The feedback is specific enough to act on, so fixing is targeted work, not a guessing game against a hidden rubric.
This matters because the alternative is the experience most builders have had on other platforms: you publish, it looks fine, and three weeks later a support ticket tells you the agent has been quietly doing the wrong thing for a class of inputs you never tried. The bar moves that discovery to before launch, where it is cheap to fix and nobody got hurt.
It also keeps the standard consistent across the marketplace. A user picking an agent should not have to research which builder is careful. The bar means the floor is the same everywhere, which is the whole point of letting someone describe an outcome and trust the result.
The bar does not stop at launch
A test suite proves an agent works against known cases on the day it ships. Production keeps inventing new ones. So the evaluation baseline becomes a live reference: Gravity watches real runs and compares their outcomes to what the suite expected. If the failure rate on a capability climbs past a threshold, the agent is flagged, the builder is notified with the failing pattern, and the capability can be paused.
Pausing a charging agent sounds drastic. It is the right default. A broken agent that keeps running is worse than one that is temporarily off, because every run during the broken window costs a user money and trust. This is the operational side of the bar, and it leans on the same discipline described in agent monitoring and observability and reliability testing. A bar you only check once is a marketing claim. A bar you keep checking is a quality system.
Why the bar protects builder earnings
Here is the part builders sometimes miss when they see a gate and read it as friction. On Gravity, a builder earns a share of every run, which means income is a function of how often an agent keeps getting used. A flashy agent that wins the first run and loses the user's trust on the second is worth almost nothing. A dependable agent that someone runs every week for a year is worth a great deal. The bar pushes builders toward the second outcome, which is the profitable one.
That alignment is the quiet reason the quality bar is not in tension with the marketplace, it is the marketplace. Pricing that charges per run rather than per seat only works if runs keep happening, and runs only keep happening if agents stay reliable. The bar, the pay model, and the monitoring all point the same direction. A builder who clears the bar is not satisfying a gatekeeper, they are protecting the asset that pays them.
None of this makes an agent perfect. Things still break, and when they do the system is built to catch them fast and fix them in the open. What the bar guarantees is not that nothing ever fails, but that failure is the exception the system hunts, not the surprise the user discovers.
FAQ
- What is Gravity's agent quality bar?
- It is a fixed set of more than 80 tests that every agent capability must pass before it can publish and earn credits. The tests cover the happy path, edge cases, malformed inputs, permission limits, and failure handling. An agent that does not clear the bar does not go live, no matter who built it.
- Why does Gravity test 80 cases per capability?
- Because agents act, and a wrong action costs real money or trust. A demo proves an agent can succeed once. Eighty tests prove it does not fail in the predictable ways: empty inputs, rate limits, ambiguous instructions, and partial failures. The number is a floor for serious capabilities.
- Who runs the tests, the builder or Gravity?
- Both. The builder writes and runs the capability's tests while building, and Gravity runs an independent evaluation suite before publishing and re-runs it on every update. Builders see exactly which cases failed so they can fix them rather than guess.
- What happens when a published agent starts failing?
- Gravity monitors live runs against the evaluation baseline. If failure rates rise past a threshold, the agent is flagged, the builder is notified, and the capability can be paused so a broken agent stops charging users while it is fixed.
- Does the quality bar slow builders down?
- It adds work up front and saves far more later. A builder earns on every run, so a reliable agent that keeps getting used is worth more than a flashy one that gets abandoned. The bar protects the builder's earnings as much as the user's trust.
- How is this different from how other platforms ship agents?
- Many platforms ship whatever a builder publishes and let users discover the failures. Gravity gates publishing on an evaluation suite, monitors live reliability, and ties builder pay to repeat runs. The incentive and the gate both point at quality rather than volume.
Sources
- McKinsey & Company, "The state of AI: Global survey", 2024, mckinsey.com
- Gravity, "How we run 80+ tests per agent capability", 2026, gravity.fast
- Gravity, "AI agent evaluation metrics", 2026, gravity.fast
