Gravity Builder Program: Early Learnings

I want to be honest about where we are. Gravity is pre-launch, and the way we work with builders is early and still moving. So this is not a victory lap. It is a founder reflection on what we have learned designing how Gravity works with the experts who build our agents. The short version: deciding to have experts build agents, rather than asking every user to build their own, was the most important early call we made, and almost everything we have learned since flows from getting that one decision right.

Below I walk through why we made that call, what a strong builder actually looks like, how our quality bar works, how we brief and review, and the things that genuinely surprised us. If you want the wider context on how we think about quality, start with our agent quality bar explained.

Why we have experts build the agents

We decided early that users should never have to build an agent to get one. The thing I keep seeing is that people will happily try an AI tool, watch it do something impressive once, and then quietly give up when it cannot do that same thing dependably the next day. The gap between a good demo and dependable daily output is where most adoption dies. That gap is exactly what an expert builder closes before a user ever arrives.

Most people do not want to assemble a workflow. They want the outcome. Asking a busy founder to wire up tools, write prompts, and handle edge cases is asking them to become an engineer for an afternoon. So we flipped it. On Gravity, the person who has done the underlying work for years builds the agent, and the user simply describes what they need. This is the same principle we wrote about in describe the outcome, not the workflow.

This is a service-provider relationship, not a listing arrangement. Gravity invites an expert to build a specific agent, agrees the scope, and pays them for building and maintaining it. We then run that agent on our own infrastructure, carry the cost of every run, and are responsible for the service the user receives. The builder brings the judgment; we carry the operation.

The case for curation over self-service

Self-service agent tools push the hard part onto the person least equipped to handle it. In my own experience watching agent projects, the model is rarely where things stall. They stall on the unglamorous operational work around the model: handling bad inputs, knowing when to stop, keeping the thing healthy as tools change underneath it. Curation moves that work to people who do it well. We would rather maintain a smaller catalog we genuinely stand behind than a large one we cannot vouch for.

What a strong builder looks like

The strongest builders we have worked with are practitioners first and tinkerers second. Plenty of teams now use these tools day to day, but the ones who get dependable results out of them are almost always the ones with real domain expertise shaping the work. That matches what we see on our side: domain judgment, not prompt cleverness, predicts a good agent. The person who knows the work cold knows what "right" looks like, and that is what an agent has to reproduce.

What does that look like concretely? A strong builder for a hiring agent has screened hundreds of candidates and knows where résumés mislead. A strong builder for a reconciliation agent has closed real books and knows which mismatches matter and which are noise. They bring the edge cases with them, because they have already been burned by every one of them. That hard-won knowledge is the thing a generic tool cannot fake.

What we have found is that enthusiasm for AI is not the signal. Plenty of people are excited about the technology and want to build something with it. The builders who produce reliable agents are usually the ones who are a little skeptical, who keep asking what happens when the input is malformed or the customer is angry or the data is half-missing. That instinct to distrust the happy path is gold.

The trait we now screen for first

If I had to name one trait, it is caring more about reliability than novelty. Early on we were charmed by clever agents that did impressive things in a demo. We learned to be charmed instead by builders who could describe, without prompting, the ten ways their agent could fail and what it would do in each case. That is also the mindset behind how I pick what to build next: dull and dependable beats flashy and fragile.

How the quality bar works

No agent reaches a user on judgment alone. Every agent is checked against a structured suite of roughly 80 tests before it joins the catalog, a method we documented in full in how we test AI agents with 80 tests. Anything we have shipped that looked strong on typical inputs has, sooner or later, fallen over on a malformed or hostile one. The suite exists to surface exactly that kind of failure before a user does. Anthropic's own guidance on building effective agents (Anthropic, 2024) pushes the same instinct: keep agents simple, test them against real conditions, and add complexity only when it earns its place.

The tests are not all about whether the agent gives a good answer on a good day. They cover four bands: normal inputs, awkward edge cases, deliberately bad inputs, and safety. An agent has to behave correctly across all four, including refusing or escalating when it should, before it earns a place. Passing the easy band is table stakes; the bar lives in the other three.

Crucially, the bar is not a one-time gate. We rerun the suite over time so that an agent which drifts, because a tool changed or an underlying model updated, gets caught before a user is affected. Gravity runs the agent, so this monitoring is ours to do. The builder keeps the agent healthy with us, and we hold the catalog to the standard we promised.

Why a number, not a vibe

Putting a structured count on the bar forces discipline. A vague "it seems reliable" lets weak agents slip through on a good demo. A concrete suite makes the conversation with a builder objective: here are the cases, here is where it passed, here is where it did not. We have found that builders respond well to this, because it turns review from an opinion into a checklist they can actually work against.

How we brief and review builders

A good agent starts with a good brief. We have learned to write briefs around the outcome and the failure modes, not the implementation. Every time we have been burned, the root cause traced back to something we left vague at the start rather than something the builder got wrong later. Ambiguity at brief time is the most expensive ambiguity there is, because it compounds through every round of review after it.

So our brief names three things plainly: what a finished, correct run looks like for the user; the inputs the agent must handle gracefully, including the ugly ones; and the situations where the agent must stop and hand off rather than guess. That last part matters most. An agent that knows its own limits is worth more than one that confidently does the wrong thing.

Review then mirrors the brief. We run the agent against the test suite, walk the failures with the builder, and iterate. We are not looking for perfection on the first pass. We are looking for a builder who treats each failed case as information rather than an insult, and who closes the gaps methodically. The agents that ship are the ones that survive that loop without anyone losing patience.

What surprised us early

The biggest surprise was how often the limiting factor was the brief, not the builder. I went in assuming the hard part would be finding people who could build a good agent. It turned out the harder part was us saying clearly enough what a good agent even was. When an agent underperformed, the root cause was usually something we had left vague, not something the builder got wrong. The brief, not the builder, set the ceiling.

The second surprise was how much builders wanted the constraints. I expected experienced practitioners to chafe at a strict quality bar and a prescriptive brief. The opposite happened. Good builders found the structure freeing, because it told them exactly what "done" meant and let them stop guessing at our standards. The bar became a shared language rather than a hurdle.

The third surprise was quieter and more humbling. The agents that excited us in early conversations were often not the ones users would value most. Some of the most useful agents are unglamorous: tidy a messy export, chase a list of overdue payments, turn a transcript into a clean summary. We had to retrain our own taste away from the impressive and toward the genuinely useful, which is a lesson I keep relearning as a founder.

What we are refining next

We are still early, so honesty matters more than polish here. The cost that keeps surprising us is not the initial build, it is the ongoing maintenance, keeping an agent healthy run after run as tools shift and models update underneath it. That is the part we underestimated, and it is shaping where we invest next, because an agent is only as good as its ten-thousandth run, not its first.

Three things are on our list. We are tightening how briefs capture failure modes, so less is left implicit and review starts closer to done. We are making the rerun-the-suite monitoring more automatic, so drift is caught faster and the builder and we hear about it together. And we are getting clearer about which domains to open next, weighing real demand against whether we can find a builder with genuine depth in that area.

What will not change is the shape of the relationship. Gravity commissions experts, pays them for building and maintaining agents, and runs those agents itself while carrying the cost and the responsibility. Users describe an outcome and get a finished result. If you are weighing whether this model holds up, the harder lessons that shaped it are in three startups, three shutdowns and the financial reality in bootstrapping an AI agent platform in 2026.

Frequently asked questions

How does Gravity work with builders?

Gravity invites domain experts to build agents to a defined quality bar. We brief them on the outcome a user needs, they build and maintain the agent, and Gravity reviews it, runs it on our own infrastructure, and stands behind the result. It is a service relationship, not a listing arrangement.

Who builds the agents on Gravity?

People who have done the underlying work for years build the agents. A recruiter builds the screening agent, an accountant builds the reconciliation agent. We look for hard-won domain judgment, because the edge cases an expert already knows are exactly what separates a reliable agent from a fragile demo.

Does Gravity pay its builders?

Yes. Builders are paid for the work of building and maintaining agents for Gravity. It is a service-provider relationship: we commission the agent, agree the scope, and compensate the builder for delivering and keeping it healthy. Gravity then runs the agent itself and carries the cost of every run.

What is Gravity's quality bar for agents?

Every agent is checked against a structured test suite of roughly 80 cases covering normal inputs, edge cases, bad inputs, and safety. An agent joins the catalog only after it passes that bar. The same suite reruns over time, so an agent that drifts is caught before a user ever sees it.

How can I become a Gravity builder?

Start from a task you have done expertly for years, not from a tool you want to try. The builders we work best with bring deep domain judgment and care about reliability over novelty. You can register interest through our builders page, and we reach out as we open capacity for new domains.

Three things I would tell a prospective builder

Lead with the domain. Your years of judgment matter more than your comfort with any tool.
Respect the bar. The 80-test suite is a shared definition of done, not a gate to resent.
Build for the ten-thousandth run. Reliability over a long tail beats brilliance in a single demo.

Sources

Anthropic, "Building Effective Agents", 2024, anthropic.com/engineering/building-effective-agents
Gravity build-in-public notes, 2026 (founder's own observations from running the builder program).
Further reading: How We Test AI Agents With 80 Tests, the methodology behind our quality bar.