Production AI agents need updates. Models improve. Prompts get tighter. Tools get added. The team finds a way to make the stopping rule clearer. The question is not "should we update" but "how do we update without breaking production trust." This guide is the change-management playbook for AI agents.
This builds on AI agent prompt versioning; that piece covers how prompts are stored and rolled back. This one covers the deployment loop around all kinds of agent changes.
Change classes
Three classes of change cover almost every update. Each gets a different rigour level.
| Class | Examples | Required gates |
|---|---|---|
| Trivial | Typo fix, comment, formatting | Code review only |
| Tactical | Prompt clarification, new few-shot example, increased rate limit on an internal tool | Eval gate + canary 5 to 10 percent |
| Structural | Model swap, tool added/removed, output schema change, stopping rule change | Eval gate + extended canary + consumer notice + rollback rehearsal |
Structural changes are where most production incidents originate. Treat them with explicit attention, not the same flow as a typo fix.
Cadence by class
Healthy production agents ship trivial changes daily, tactical changes 1 to 5 times per week, and structural changes on a slower cadence (monthly or per quarter). The cadence is set by the eval discipline, not by an absolute calendar.
The internal metric. Track the time from a change author opening a PR to the change reaching 100 percent of traffic. For tactical changes this should be hours, not days. For structural changes, days to weeks. If tactical changes routinely take days, the eval set is too slow or the canary windows are too long; tighten the loop.
Canary deploys
The canary is the deployment pattern that lets you ship faster without giving up safety. The mechanic is straightforward; the discipline is in what you measure.
Traffic split. 5 to 10 percent of qualifying runs to the new version, 90 to 95 percent to the baseline. The split is determined by a hash of run_id or user_id so the same user gets a stable experience.
Measurement window. Long enough to accumulate statistical significance on the task-success metric. For a high-volume agent, hours. For a low-volume one, days. Set the window from the volume math, not from intuition.
Comparison metrics. Task success rate (headline). p50 and p95 latency (secondary). Token cost per run (secondary). Error class distribution (tail). Any metric that regresses outside its confidence interval triggers automatic rollback.
Ramp. If the canary holds, ramp to 25 percent, 50 percent, 100 percent at intervals matched to the measurement window. Skipping the ramp is the most common cause of "the canary looked good but production broke."
Model version updates
Model updates are higher risk than they look. A new minor version of the same model family can interpret ambiguous prompts differently, produce different tool-selection probabilities, and change latency and cost.
Pin explicitly. OpenAI exposes pinned model identifiers like gpt-4o-2024-08-06 (OpenAI models page). Anthropic exposes pinned identifiers like claude-3-7-sonnet-20250219 (Anthropic models page). Use the pinned identifier; never use the alias unless you have a tested rollback path for the underlying change.
Update process. Treat a model bump as a structural change. Run the eval set against the new pin. Canary at 5 percent. Measure for at least 48 hours of representative traffic. Ramp if it holds.
Deprecation calendar. Track the deprecation dates of every pinned model you depend on. The provider gives months of notice; the team that does not track them gets surprised at the cutover.
Tool updates
Adding or removing a tool changes the action space the agent considers on every call. A tool that fits an existing description shape can shift tool-selection accuracy in subtle ways.
Adding a tool. Update the eval set first with cases that should and should not use the new tool. Promote the prompt and tool together with a canary. Watch for tool-selection drift: the new tool getting picked for tasks it was not designed for, or existing tools being displaced.
Removing a tool. Mark the tool deprecated for one full release cycle before removal. Existing eval cases that referenced the tool need substitution before the tool is gone. Sudden removal causes the agent to flounder on tasks that used to work.
Changing a tool's signature. Treat as a removal of the old tool plus addition of the new one. Update descriptions, examples, and evals together.
Rollback discipline
The fastest path to safe rapid iteration is being able to revert quickly.
Rollback by flag. The active prompt, model pin, tool list, and stopping rules are all behind a flag that points to a versioned artifact. Rolling back is a flag flip, not a redeploy. The prior version stays addressable; no history is lost.
Rehearse it. The first time you flip the production flag should not be during an incident. Once per quarter, do a controlled rollback drill: deploy a no-op change, roll it back, verify the audit log shows the rollback and the new traffic uses the prior version.
For more on rolling back individual agent actions (vs deployments), see how to roll back an agent action.
Communicating updates
Structural changes need consumer notice. For internal consumers (downstream services or other agents that depend on output shape), an email plus a deprecation window suffices. For external customers (enterprise self-service agents, embedded agents), publish a changelog with version, change class, expected impact, and the rollback plan.
The changelog discipline pays back. Customers who can see the change history develop a calibrated trust in the team that ships changes. Customers who see only outages develop the opposite.
Update freezes
Two situations call for an explicit update freeze. Active incidents. While an investigation is open, no agent updates ship to production. The freeze prevents new changes from confusing the investigation. Customer-critical windows. Quarter-end for finance customers; holiday peak for retail customers; tax season for accounting customers. Communicate the freeze in advance; deploy only urgent fixes during it.
The freeze must be tooled, not policy alone. A feature flag at the deploy stage blocks promotion during freeze windows. Without the tool, a well-meaning engineer ships "just a small change" and discovers the freeze the hard way.
Tracking upstream changes
The model providers, tool vendors, and platform layers your agent depends on also ship updates. Three categories matter.
Deprecations. Pinned model versions have end-of-life dates. The deprecation calendar is the single most important upstream artifact to track. Subscribe to provider notification lists and check the calendar at least monthly.
Behavior changes within a pin. Provider safety filter updates, rate-limit changes, and latency-tier shifts happen within a pinned model. Subscribe to provider changelogs and budget time monthly to triage their impact.
Tool-provider API versions. Slack, Salesforce, Stripe, and other SaaS providers version their APIs. A new version may add fields you want; an old version may stop being supported. Track the dependency the same way you track library versions in package.json: pin, review, and update on a known cadence.
Frequently asked questions
How often should I update an AI agent in production?
As often as the change passes the eval gate. Tactical changes 1 to 5 times per week, structural changes monthly or per quarter.
What is a canary deploy for an AI agent?
A traffic split that routes a small fraction to the new version; metrics are compared against the baseline before ramping.
What changes to an AI agent are most risky?
Model swaps, tool additions or removals, and stopping-rule changes. All need eval coverage, canary traffic, and explicit rollback plans.
How do I know if an AI agent update worked?
Task success rate, latency, token cost, and error distribution should hold or improve in the canary window.
Should AI agent updates be coupled to model provider updates?
No. Pin the model version explicitly. Move to a new pin only after running the eval gate.
Three things to ship this week
- Classify your last 10 changes into trivial, tactical, or structural. Adjust gates per class.
- Pin every model identifier. Replace any aliases with date-stamped pins.
- Rehearse a rollback. No-op change, flag flip, audit verification.
Sources
- OpenAI, "Models", platform.openai.com
- Anthropic, "All models overview", docs.anthropic.com
- Anthropic, "Building Effective Agents", 2024, anthropic.com
- OpenAI, "A Practical Guide to Building Agents", 2024, openai.com
- Google SRE Book, "Release Engineering", sre.google