How We Curate Launch Agents at Gravity

When you open the Gravity catalog, you see a list of agents that each do a specific job, and you can run any of them in about 60 seconds by describing what you need. What you do not see is the work that decided which agents are on that list. Curation is that work: the choosing, testing, and standing-behind that turns a candidate agent into something we are willing to put a user's task through. This post is about how that decision gets made.

We are writing it because curation is the part of an agent platform that is easy to take for granted and expensive to get wrong. A long catalog of impressive-looking agents is worthless if half of them fail quietly on real inputs. A short catalog of agents that reliably finish the job is the product. So we treat curation as a first-class engineering and editorial function, not a marketing exercise, and the rest of this piece explains the bar, the people, and the gates behind it.

Key takeaways

Curation is a quality gate, not a long list. An agent is in the launch catalog because it cleared a reliability bar, not because it looked impressive in a demo.
Expert builders build and maintain agents for Gravity. Gravity pays them for that work, runs the agents, carries the compute cost, and owns the service the user receives.
Every agent clears a structured test suite. Common path, edge cases, adversarial inputs, and failure handling all have to hold before an agent is available to users.
You run a tested agent, not raw infrastructure. The reliability risk sits with the platform, so you describe an outcome and trust the result.
Curation is continuous. Agents are monitored, refreshed, and re-tested after launch; ones that drift below the bar are fixed or pulled.

What curation means here

Curation, in the sense we use it, is the decision about which agents are good enough to carry a user's real task. It is closer to what an engineering team does when it decides whether code is ready to ship than to what a directory does when it accepts a listing. The default answer for any candidate agent is no, and the work of curation is producing the evidence that turns a no into a yes for a specific, scoped job.

That framing matters because Gravity is a platform, not a directory you browse and self-serve from. When you run an agent, you are not picking up a tool and assuming the risk of operating it; you are running a service that Gravity stands behind. The catalog is therefore a set of commitments, and each entry exists because we are prepared to be responsible for the result it produces. If you want the broader vocabulary, our explainer on what an AI agent is sets the baseline; curation is the layer that decides which of those agents are fit to run.

Who builds the agents

The agents are built by expert builders who know a particular job well: a domain specialist, a practitioner, someone who has done the underlying work by hand enough times to know where it goes wrong. They build and maintain the agent for Gravity, and Gravity pays them for that work. This is a service-provider relationship, not a self-publishing one. A builder is not listing a product to take a cut at the door; they are doing the work of making an agent reliable, and Gravity compensates that work.

The distinction is load-bearing for curation. Because Gravity runs the agents, carries the compute cost, and is responsible for the service, the builder's incentive is aligned with the user's: an agent that fails on real inputs is a problem for everyone, not a one-time sale that has already cleared. So a builder's job does not end at a working prototype. It includes hardening the agent against the messy inputs real users send, handling failure gracefully, and keeping the agent current as models and tools change. We have written about what that collaboration looks like in practice in our notes from the Gravity Builder Program.

This also shapes scope. An expert builder is encouraged to make an agent that does one job well rather than a vague everything-agent, because a tight scope is what makes reliability testable. An agent that claims to handle anything cannot be verified; an agent that turns a specific input into a specific, checkable output can be. Curation rewards the second kind.

The reliability bar

The single most important idea in curation is that an agent has to be reliable across many runs, not correct in one demo. Anyone can show an agent succeeding once. The question that matters for a user about to put a real task through it is whether it succeeds the next hundred times, on inputs the builder did not hand-pick. That is the bar, and it is the same standard we describe in our deeper guide to AI agent reliability testing.

We think about this as a reliability bar expressed through a suite of tests, on the order of dozens of distinct cases per agent, covering the situations a real task throws at it. The point of running an agent against many cases rather than one is to catch the failures that only show up at the edges: the input that is slightly malformed, the request that is technically in scope but unusual, the case where the right answer is to stop and say it cannot proceed. An agent that passes its happy path but breaks on those does not clear the bar, because users do not send happy-path inputs on purpose.

Crucially, the bar is about the result the user gets, not the cleverness of the internals. Two agents can use very different models and tooling; what curation cares about is whether each one reliably produces an output we can stand behind. This is the same lens a careful buyer would apply, and it is exactly what we lay out in our framework for how to evaluate AI agent platforms. The difference is that on Gravity, the platform does that evaluation so the user does not have to.

How the testing works

Before an agent is available to users, it runs against a structured test suite. The suite is organized around the kinds of failure that actually happen, rather than around the inputs that make the agent look good.

The common path. First, the agent has to reliably handle the central job it exists to do, across varied phrasings and realistic inputs, not a single canned example. If the everyday case is shaky, nothing else matters.

Edge cases. Then the harder cases: unusual but legitimate inputs, boundary conditions, ambiguous requests, and the long tail of variations that a real user base produces. These are where most agents quietly fail, so they carry weight in the decision.

Adversarial and malformed input. The agent is tested against inputs that are broken, incomplete, or deliberately misleading. The goal is not only correctness but safe behavior: the agent should not be derailed by a confusing prompt or produce a confidently wrong result on garbage input.

Failure handling. When the agent cannot complete a task, it has to fail in a way the user can understand and act on, rather than returning a plausible-looking but wrong answer. An agent that knows when to stop is more trustworthy than one that always returns something.

An agent that holds reliability across all of that clears the bar. One that passes some categories and fails others goes back to the builder with the specific cases that broke, and the cycle repeats until it holds or until we decide the job is not yet reliable enough to ship. Pricing follows the same per-use logic on the other side of that gate, which we walk through in our explainer on the Gravity credits model.

How curation protects users

The payoff of all this is simple to state: you run a tested agent, not raw infrastructure. Building a reliable agent yourself means choosing a model, writing and tuning prompts, wiring up tools, handling errors, and then judging whether the output can be trusted, every time, for every job. Curation does that work once, verifies it, and hands you an agent you can run by describing an outcome.

That shifts where the risk lives. On a do-it-yourself stack, the reliability risk is yours: if the agent fails on an odd input, you find out when your task is already wrong. On Gravity, the platform has already pushed that input through the test suite, and the platform carries the responsibility for the result. You are running a service with a known floor of quality, which is a categorically different experience from operating infrastructure and hoping it holds. For a sense of how this fits the wider field, our best AI agents roundup for 2026 sets the context Gravity is built for.

What gets rejected

It is easier to trust a curation process when you know what it turns away. Several patterns reliably get an agent rejected, or sent back before it can ship.

Cannot hold reliability. The most common reason. The agent works in a demo but breaks across the test suite, especially on edge cases and malformed input. A strong first impression does not survive contact with the cases real users produce.
Scope too vague to test. An agent that promises to handle anything cannot be verified, because there is no defined output to check against. If we cannot write tests that say pass or fail, we cannot stand behind it.
Redundant without being better. An agent that duplicates one already in the catalog without doing the job more reliably adds choice without adding value, and choice for its own sake makes the catalog harder to trust, not easier.
Fails unsafely. An agent that returns a confident, wrong answer on bad input rather than stopping is more dangerous than one that does less. Safe failure is a requirement, not a nice-to-have.

None of these are judgments about effort or ambition. They are judgments about whether we can be responsible for the result a user would get. Because Gravity owns the service, the question we ask of every candidate is the same: can we stand behind what this agent hands back? If the answer is not yet yes, it does not ship, and the builder gets the specific reasons so the gap can be closed.

How curation evolves after launch

Curation does not end when an agent reaches the catalog; that is where the ongoing part begins. Agents are monitored in production, where the inputs are more varied and less predictable than any test suite, and that real-world behavior feeds back into the bar. When a model improves, a tool changes, or a new failure pattern appears, the builder refreshes the agent and it is re-tested against the same standard before the update goes live.

The catalog itself changes too. New agents are added as genuine demand appears, rather than to pad a number, and each new entry clears the same gate as the launch set. An agent that drifts below the reliability bar over time is fixed or pulled, because a stale agent that used to work is still a broken promise to the user running it today. The launch catalog is the starting point of a maintained system, not a frozen list, and that maintenance is part of what Gravity is paying builders to do and taking responsibility for.

The short version

Curation is a quality gate. Agents are in the catalog because they cleared a reliability bar, not because they looked good in a demo.
Builders build for Gravity; Gravity owns the service. Expert builders are paid to make and maintain agents, and the platform runs them and carries the cost and responsibility.
You run a tested service. The testing, hardening, and judgment happen before the agent reaches you, so the reliability risk sits with the platform.
It keeps going. Agents are monitored, refreshed, and re-tested; the catalog grows and prunes against the same bar over time.

Frequently asked questions

What is a launch agent on Gravity?

A launch agent is an expert-built agent that has cleared Gravity's review and is available in the catalog when the platform opens. Each one is scoped to a specific job, validated against a reliability bar, and maintained over time. As a user you describe an outcome and run the agent; you are running a tested service, not assembling raw infrastructure yourself.

Who builds the agents in Gravity's catalog?

Expert builders build and maintain the agents for Gravity, and Gravity pays them for that work. They are service providers who bring domain knowledge to a specific job, not sellers listing a product on their own. Gravity runs the agents, carries the compute cost, and is responsible for the service the user receives, so the builder's job is to make the agent reliable and keep it that way.

What testing does an agent have to pass before launch?

Every candidate agent runs against a structured test suite before it is available to users. The suite covers the common path, harder edge cases, malformed or adversarial inputs, and failure handling, so the question is not whether an agent works once in a demo but whether it holds up across many varied runs. An agent that passes its happy path but breaks on edge cases does not clear the bar.

Why does curation protect the user?

Curation means you run a tested agent rather than raw infrastructure. You do not have to evaluate a model, write prompts, wire up tools, or judge whether the result can be trusted; that work is done and verified before the agent reaches the catalog. The reliability risk sits with the platform, not with you, which is the difference between running a service and operating a stack.

What gets an agent rejected?

An agent is rejected when it cannot hold reliability across the test suite, when its scope is too vague to evaluate, when it duplicates an existing agent without doing the job better, or when it cannot fail safely on bad input. A flashy demo is not enough. If we cannot stand behind the result a user would get, the agent does not ship, because Gravity owns the service.

Does the catalog stay fixed after launch?

No. Curation is continuous. Agents are monitored in production, refreshed by their builders as models and tools change, and re-tested against the same bar after meaningful updates. New agents are added as real demand appears, and an agent that drifts below the reliability bar is fixed or pulled. The launch catalog is the starting point, not the final list.