Agent updates fail badly when they are in-place. A prompt edit lands mid-run and the second half of a conversation no longer matches the first. A new model lands and a tool-call signature shifts. An index rebuild swaps documents under a retrieval call and the citation is gone. The discipline is to never mutate a running version, and to switch versions only at safe boundaries. Companion to blue-green deployment, canary releases, and the broader update cadence playbook.

This piece covers the bundle, the routing, the drain mechanism, and the rollback path. The pattern works for any update class: code, prompt, model, index, tool definition, policy.

Bundle every change as one immutable version

The unit of update is not "the prompt" or "the model" but a bundle: the prompt text, the model id, the temperature and other sampling settings, the tool registry (which tools are callable and which schemas), the retrieval index pointer, the policy and guardrail config, and the orchestrator code commit. Anything that affects agent behavior. The bundle gets a content hash; that hash is the version id.

This matters because behavior is a function of the whole bundle, not any single component. Edit one piece in place and the behavior drift is invisible until a user complains. The bundle approach also makes evals reproducible: every eval result is tagged with the bundle id; comparing eval runs across versions becomes one join.

Practically, a bundle manifest looks like a JSON document with fields for each component, the hash of each, and the hash of the manifest itself. Store bundles in object storage, addressable by hash. The orchestrator loads a bundle on demand and caches it for the request lifetime.

Pinned-version routing at the request boundary

A request enters the platform and immediately gets a version stamp. The router decides which bundle the request will run against. Once stamped, the request executes against that bundle for its entire lifetime, even if a newer bundle is published mid-run.

The stamping rule is what enables overlapping deployments. Two common rules.

The mechanism mirrors the rolling-update and Service abstractions described in the Kubernetes deployment docs, where a stable ClusterIP routes new connections to updated pods while existing connections drain on the old ones (Kubernetes Deployments, 2025). The version pointer in an agent platform plays the role of the Service selector.

Drain-and-replace for stateful runs

Most agent runs are short and stateless: stamp on entry, finish, release. A few are long-running: human-in-the-loop approvals, multi-step orchestrations that pause for external events, scheduled agents that run for minutes. For these, drain matters.

Drain rules.

  1. Old-bundle traffic stops at the router. Once the new bundle is "current", no new request stamps with the old version. Existing in-flight runs keep going.
  2. Both bundles stay loaded. The orchestrator keeps the old bundle warm for the drain window: typically the p99 run duration plus a buffer. For most agent platforms this is 5 to 30 minutes.
  3. Long-runners get a deadline. Any run that exceeds the drain window is migrated (if the state shape is compatible) or marked for completion on next event boundary (for human-in-the-loop pauses).
  4. Unload after drain completes. The orchestrator marks the old bundle for unload only after the in-flight run count hits zero, or the deadline expires.

Hot-swapping prompts

Prompt updates are the most frequent agent change and the one teams break the most. The temptation is "the prompt is just a string, edit it in place". This breaks any run that was halfway through the original prompt; the model now sees inconsistent context across turns.

The fix is the same as any bundle change: a new prompt produces a new bundle. The bundle hash changes; existing runs keep using the old hash; new runs pick up the new hash. The prompt file in source control is the authoritative version; the prompt is loaded into the bundle at build time, not at request time.

For platforms that allow runtime prompt edits (a builder UI with a "save" button), the save operation should mint a new bundle, not mutate a stored prompt. The cost is more bundle objects in storage; the benefit is the same correctness guarantee as code deploys.

Model upgrades without downtime

Model upgrades are the riskiest agent update class. Output behavior can shift in ways the eval suite does not catch. The recipe.

  1. Pin the model id in the bundle. Never reference "latest". The bundle says "anthropic/claude-sonnet-4-5-20250929" or similar, exact.
  2. Shadow the new model first. Build a candidate bundle with the new model id. Send a copy of production traffic to the candidate; do not return the candidate response to the user; compare results against the production response. This is the same shadowing pattern used in canary releases for inference systems (Fowler, Parallel Change, 2014).
  3. Promote via canary. Once shadow results match (eval-defined), route 1 to 5 percent of real traffic to the candidate. Monitor for output quality, latency, cost, error rate.
  4. Cut over. Ramp to 100 percent when the canary holds. Keep the prior bundle warm for the rollback window.

Vendor providers (OpenAI, Anthropic, Google) document model-version pinning and deprecation timelines so callers can plan upgrades; Anthropic's model deprecation policy gives at least six months' notice before retiring a snapshot (Anthropic model deprecations, 2025). Pin to the snapshot; upgrade on your schedule, not the vendor's.

Vector-index swaps

Retrieval indexes are state that the agent reads. Mutating an index under a running query produces inconsistent results: chunks that existed at the start of the query may be gone by the time a follow-up retrieval runs.

The pattern is the same pointer flip.

For platforms that need continuous updates (a CRM index that gets writes throughout the day), use a delta pattern: the read path queries both the indexed corpus and a recent-writes overlay; the overlay is small and fast; periodic compactions merge the overlay into the main index in a swap operation.

Rollback path

The rollback story is the most undervalued part of a zero-downtime design. If you can roll forward without downtime, you can roll back the same way. The rollback time depends on whether the prior bundle is still warm.

The Google SRE workbook recommends keeping at least N prior versions warm enough for fast rollback, where N is determined by how many bundles ship per day and how long incidents take to detect (Google SRE workbook, Canarying releases, 2018). For agent platforms shipping multiple bundles a day, N=3 to 5 is a sane default.

Common pitfalls

Four patterns that show up in agent platform incident reviews.

"Just edit the prompt". Someone with builder UI access edits a prompt in place, no new bundle, no version stamp. Runs already in flight start using the new prompt halfway through. The fix is making in-place edits structurally impossible: every save mints a new bundle.

The orchestrator caches a bundle by name, not hash. A worker process loaded "bundle_current" at startup and cached it. The pointer flipped; the worker never noticed. The fix: cache by hash, refresh the hash from the pointer on every new request.

Drain window too short. The platform unloaded the old bundle after 30 seconds. A long-running agent run that paused for human approval failed when it tried to resume. The fix: drain windows sized by p99 run duration, not p50.

Retention purges the rollback target. A bundle older than 30 days got swept by storage retention; that bundle was the last known good version after a bad deploy. The fix: tag the rollback target as "do not purge" until a newer "known good" exists.

FAQ

What counts as a zero-downtime update for an AI agent?
An update where in-flight runs finish on the old version, new runs start on the new version, and users see no interruption. The mechanism is atomic version switch at the request boundary.
Can you hot-swap a prompt without restarting the agent?
Yes, by versioning prompts as immutable bundles and routing each request to a pinned version. New requests get the new version once the route flips; in-flight requests finish on the version they started with.
How is zero-downtime different from blue-green deployment?
Blue-green is one zero-downtime mechanism. Zero-downtime is the goal, achievable via blue-green, rolling updates, canary releases, or pinned-version routing. The choice depends on cost and state.
What about updates to the vector index?
Treat the index as a versioned artifact. Build the new index alongside the old, atomic-rename the pointer when ready, keep the old index for at least one release cycle for rollback.
How do I roll back a zero-downtime update?
Flip the version pointer back to the prior bundle. If the prior version is still warm, rollback is sub-second. If unloaded, it is a fresh load and pointer flip.
Do model upgrades need any special handling?
Yes. Pin the model snapshot in the bundle. Shadow the new model against production traffic first. Promote via canary. Keep the prior model accessible for the rollback window.

Sources