Processes change. New CRM field, new approver, new review step, new tool. The agent that was right last quarter is silently wrong this quarter, and the gap shows up as runs that look fine on the surface but produce slightly off output. Updating the agent is not hard; updating it without breaking the runs that depend on it is the part most teams miss.

This guide covers four decisions: when to edit versus rebuild, how to stage a change, how to verify the new version against last week's inputs, and what to update when only the tools changed. The rules below assume an agent that is in production and has people downstream of its output.

Update or rebuild

The first decision is binary. The agent's outcome is either the same or it has changed. If sales added a new field to lead records and the agent should now extract that field too, the outcome is unchanged: still "qualify and route leads". Update.

If the team decided that the agent should now also write the first follow-up email and book a meeting, the outcome has changed. Rebuild. Trying to evolve a "qualify and route" agent into a "qualify, route, follow up, and book" agent by adding paragraphs to the prompt produces a brittle Frankenstein that fails on its first edge case.

The cheap test: read the existing prompt out loud. Does it still describe what you want? If yes, edit. If you find yourself rewriting half of it, the outcome moved and you should start over. The deeper framing is in describe outcome, not workflow.

Stage new versions

The blast radius of a prompt change is the entire agent. One sentence edit can change which tools the agent calls and which side effects ship. Treat prompt changes like software deploys: version, stage, monitor, promote.

Concretely:

  1. Duplicate the agent. Edit the prompt on the duplicate.
  2. Route a small fraction of traffic (10% on day one, 25% by day three) to the new version.
  3. Watch run outcomes for a week. If the new version matches the old version's success on the same input class, promote.
  4. Keep the old version available as a one-click revert for at least thirty days.

Most agent platforms surface this as a "version" or "preview" pattern. If yours does not, the duplicate-agent pattern works manually: two agents, one routing rule, one set of metrics.

Replay against last week

Before staging, replay the new prompt in dry-run mode against the inputs the live agent processed last week. Same inputs, new prompt, no destructive writes. Compare outputs side by side.

What you are looking for: did the new version make different decisions on the same data? If yes, are those differences improvements (the goal) or regressions (the cost)? Replay surfaces both. Most accidents come from a prompt edit that sounded better and silently broke a case that mattered last week.

This is the same discipline as pre-deploy testing, applied to changes after the agent is live. The dry-run replay is a five-minute insurance policy.

ReplayLast week, dry run Stage 10%Day 1-3 Stage 25%Day 4-7 Promote100%
Update flow: replay first, then stage by traffic share, then promote.

Tool changes vs prompt changes

If a tool has changed (new endpoint, renamed parameter, new return field), update the tool description, not the agent prompt. Most agent runtimes pass tool descriptions to the model on every call, so a tool definition update propagates automatically.

The agent prompt only needs editing if the tool's purpose changed. A renamed argument is a tool-description edit. A tool that used to read and now also writes is a prompt edit, because the agent's policy about destructive actions has to be updated too.

This separation matters because tool descriptions are cheap to evolve and hard to break: you can ship a tool change in seconds, and the agent will pick up the new shape on the next run. Prompt changes are expensive to evolve and easy to break, which is why staging exists.

For the broader concept of how agents bind to tools, see AI agent tool use explained.

Quarterly review

Every quarter, take ten minutes per agent to ask:

The quarterly review catches drift early. Without it, an agent that worked at launch slowly becomes wrong as your business evolves around it. The fix is rarely a major rebuild; it is usually one or two small edits that the review surfaces before any user notices.

For the related discipline of monitoring rather than reviewing, see how to monitor agent activity. Reviews are the planned audit; monitoring is the live signal.

What changes most often in practice

Across the agents Gravity runs internally and the patterns customers describe in early conversations, four kinds of process change account for most updates:

The first three are fast updates. The fourth is the one teams under-budget; rebuilding rather than evolving saves three weeks of debugging on the next cycle.

Common mistakes

Frequently asked questions

When should I update an AI agent versus rebuild it?

Update when the outcome is the same and only the inputs, tools, or formatting changed. Rebuild when the outcome itself has changed. A new field on a form is an update. A new business goal is a rebuild. The cheap test: does the existing prompt still describe what you want? If yes, edit. If no, start over.

How do I update an agent without breaking the running version?

Version the prompt and stage the new version on a small subset of runs first. Most platforms allow a duplicate-and-edit pattern: clone the agent, change the prompt, route 10% of traffic to the new version for a week, then promote. The old version stays available as a one-click rollback for the first thirty days.

What is the safest way to ship a prompt change?

Run the new prompt in dry-run mode against the same inputs the live agent processed last week. Compare outputs side by side. Promote only after the dry runs match expectations on a representative sample. Most platforms expose this as a 'replay' or 'eval' view; if yours doesn't, copy ten run inputs into a sandbox manually.

How do I tell my AI agent that a tool has changed?

Update the tool description, not the agent prompt. Most agent runtimes pass tool descriptions to the model on every call, so a renamed argument or new endpoint propagates the moment you save the tool definition. The agent prompt only needs editing if the tool's purpose or output shape has changed.

How often should I review an AI agent for outdated logic?

Every quarter at minimum, and any time the upstream process visibly changed. The review takes ten minutes: read the prompt, list the systems it touches, sample five recent runs, and ask whether the agent still describes what you actually want today. Most updates are caught here before they become bugs.

Three takeaways before you close this tab

Sources