What counts as a zero-downtime update for an AI agent?

Any update where in-flight runs continue to completion on the old version while new runs start on the new version, with no observable service interruption to users. The mechanism is atomic version switch at the request boundary, not in-place mutation of running processes.

Can you hot-swap a prompt without restarting the agent?

Yes, by versioning prompts as immutable bundles and routing each request to a pinned version. New requests pick up the new version once the route is flipped. In-flight requests finish on the version they started with.

How is zero-downtime different from blue-green deployment?

Blue-green is one zero-downtime mechanism: two full environments, flip the router. Zero-downtime is the goal, achievable also through rolling updates, canary releases, or pinned-version routing within one environment. The choice depends on cost and state.

What about updates to the vector index?

Treat the index as a versioned artifact. Build the new index alongside the old one. Atomic-rename the pointer when ready. In-flight queries finish against the old index; new queries hit the new. Keep the old index for rollback for at least one release cycle.

How do I roll back a zero-downtime update?

Flip the version pointer back to the prior bundle. If you kept the prior version warm (the previous bundle still loaded), rollback is sub-second. If you discarded it, rollback is a fresh deploy of the prior bundle, typically minutes.

Do model upgrades need any special handling?

Model upgrades are the trickiest because output shape can shift subtly. Pin the model in the bundle, run shadow traffic on the new model first, compare evals, then flip. Keep the prior model accessible for at least one cycle so you can roll back without a vendor re-onboarding.

AI Agent Zero-Downtime Updates: Hot-Swap Configs Without Stopping Runs

Agent updates fail badly when they are in-place. A prompt edit lands mid-run and the second half of a conversation no longer matches the first. A new model lands and a tool-call signature shifts. An index rebuild swaps documents under a retrieval call and the citation is gone. The discipline is to never mutate a running version, and to switch versions only at safe boundaries. Companion to blue-green deployment, canary releases, and the broader update cadence playbook.

This piece covers the bundle, the routing, the drain mechanism, and the rollback path. The pattern works for any update class: code, prompt, model, index, tool definition, policy.

Bundle every change as one immutable version

The unit of update is not "the prompt" or "the model" but a bundle: the prompt text, the model id, the temperature and other sampling settings, the tool registry (which tools are callable and which schemas), the retrieval index pointer, the policy and guardrail config, and the orchestrator code commit. Anything that affects agent behavior. The bundle gets a content hash; that hash is the version id.

This matters because behavior is a function of the whole bundle, not any single component. Edit one piece in place and the behavior drift is invisible until a user complains. The bundle approach also makes evals reproducible: every eval result is tagged with the bundle id; comparing eval runs across versions becomes one join.

Practically, a bundle manifest looks like a JSON document with fields for each component, the hash of each, and the hash of the manifest itself. Store bundles in object storage, addressable by hash. The orchestrator loads a bundle on demand and caches it for the request lifetime.

Pinned-version routing at the request boundary

A request enters the platform and immediately gets a version stamp. The router decides which bundle the request will run against. Once stamped, the request executes against that bundle for its entire lifetime, even if a newer bundle is published mid-run.

The stamping rule is what enables overlapping deployments. Two common rules.

Default to current. The router has a "current" pointer; new requests get stamped with the current bundle. To deploy a new version, atomic-update the pointer.
Weighted by traffic share. The router holds two or more pointers and routes a percentage to each. This is the canary mechanism: 1 percent to the new bundle, 99 percent to the old, ratchet up as the new bundle proves itself.

The mechanism mirrors the rolling-update and Service abstractions described in the Kubernetes deployment docs, where a stable ClusterIP routes new connections to updated pods while existing connections drain on the old ones (Kubernetes Deployments, 2025). The version pointer in an agent platform plays the role of the Service selector.

Drain-and-replace for stateful runs

Most agent runs are short and stateless: stamp on entry, finish, release. A few are long-running: human-in-the-loop approvals, multi-step orchestrations that pause for external events, scheduled agents that run for minutes. For these, drain matters.

Drain rules.

Old-bundle traffic stops at the router. Once the new bundle is "current", no new request stamps with the old version. Existing in-flight runs keep going.
Both bundles stay loaded. The orchestrator keeps the old bundle warm for the drain window: typically the p99 run duration plus a buffer. For most agent platforms this is 5 to 30 minutes.
Long-runners get a deadline. Any run that exceeds the drain window is migrated (if the state shape is compatible) or marked for completion on next event boundary (for human-in-the-loop pauses).
Unload after drain completes. The orchestrator marks the old bundle for unload only after the in-flight run count hits zero, or the deadline expires.

Hot-swapping prompts

Prompt updates are the most frequent agent change and the one teams break the most. The temptation is "the prompt is just a string, edit it in place". This breaks any run that was halfway through the original prompt; the model now sees inconsistent context across turns.

The fix is the same as any bundle change: a new prompt produces a new bundle. The bundle hash changes; existing runs keep using the old hash; new runs pick up the new hash. The prompt file in source control is the authoritative version; the prompt is loaded into the bundle at build time, not at request time.

For platforms that allow runtime prompt edits (a builder UI with a "save" button), the save operation should mint a new bundle, not mutate a stored prompt. The cost is more bundle objects in storage; the benefit is the same correctness guarantee as code deploys.

Model upgrades without downtime

Model upgrades are the riskiest agent update class. Output behavior can shift in ways the eval suite does not catch. The recipe.

Pin the model id in the bundle. Never reference "latest". The bundle says "anthropic/claude-sonnet-4-5-20250929" or similar, exact.
Shadow the new model first. Build a candidate bundle with the new model id. Send a copy of production traffic to the candidate; do not return the candidate response to the user; compare results against the production response. This is the same shadowing pattern used in canary releases for inference systems (Fowler, Parallel Change, 2014).
Promote via canary. Once shadow results match (eval-defined), route 1 to 5 percent of real traffic to the candidate. Monitor for output quality, latency, cost, error rate.
Cut over. Ramp to 100 percent when the canary holds. Keep the prior bundle warm for the rollback window.

Vendor providers (OpenAI, Anthropic, Google) document model-version pinning and deprecation timelines so callers can plan upgrades; Anthropic's model deprecation policy gives at least six months' notice before retiring a snapshot (Anthropic model deprecations, 2025). Pin to the snapshot; upgrade on your schedule, not the vendor's.

Vector-index swaps

Retrieval indexes are state that the agent reads. Mutating an index under a running query produces inconsistent results: chunks that existed at the start of the query may be gone by the time a follow-up retrieval runs.

The pattern is the same pointer flip.

Build the new index alongside the old. Different namespace, different collection name, same vector dimension.
Backfill in the background. Index the corpus into the new collection. Verify counts, sample queries.
Atomic-rename the pointer. The bundle references "index_v17" not "production_index". New bundles point at "index_v18".
Keep the old index for a window. One bundle cycle minimum; longer if you have audits to satisfy.

For platforms that need continuous updates (a CRM index that gets writes throughout the day), use a delta pattern: the read path queries both the indexed corpus and a recent-writes overlay; the overlay is small and fast; periodic compactions merge the overlay into the main index in a swap operation.

Rollback path

The rollback story is the most undervalued part of a zero-downtime design. If you can roll forward without downtime, you can roll back the same way. The rollback time depends on whether the prior bundle is still warm.

Warm rollback. The prior bundle is still loaded (within the drain window). Rollback is a pointer flip, completing in under a second.
Cold rollback. The prior bundle has been unloaded. Rollback is a fresh load and a pointer flip, completing in 1 to 5 minutes depending on bundle size and model warm-up.
Disaster rollback. The prior bundle is no longer in storage (a retention policy purged it). Rollback requires a full rebuild. The lesson: never let "the last known good version" disappear from storage.

The Google SRE workbook recommends keeping at least N prior versions warm enough for fast rollback, where N is determined by how many bundles ship per day and how long incidents take to detect (Google SRE workbook, Canarying releases, 2018). For agent platforms shipping multiple bundles a day, N=3 to 5 is a sane default.

Common pitfalls

Four patterns that show up in agent platform incident reviews.

"Just edit the prompt". Someone with builder UI access edits a prompt in place, no new bundle, no version stamp. Runs already in flight start using the new prompt halfway through. The fix is making in-place edits structurally impossible: every save mints a new bundle.

The orchestrator caches a bundle by name, not hash. A worker process loaded "bundle_current" at startup and cached it. The pointer flipped; the worker never noticed. The fix: cache by hash, refresh the hash from the pointer on every new request.

Drain window too short. The platform unloaded the old bundle after 30 seconds. A long-running agent run that paused for human approval failed when it tried to resume. The fix: drain windows sized by p99 run duration, not p50.

Retention purges the rollback target. A bundle older than 30 days got swept by storage retention; that bundle was the last known good version after a bad deploy. The fix: tag the rollback target as "do not purge" until a newer "known good" exists.

FAQ

What counts as a zero-downtime update for an AI agent?: An update where in-flight runs finish on the old version, new runs start on the new version, and users see no interruption. The mechanism is atomic version switch at the request boundary.
Can you hot-swap a prompt without restarting the agent?: Yes, by versioning prompts as immutable bundles and routing each request to a pinned version. New requests get the new version once the route flips; in-flight requests finish on the version they started with.
How is zero-downtime different from blue-green deployment?: Blue-green is one zero-downtime mechanism. Zero-downtime is the goal, achievable via blue-green, rolling updates, canary releases, or pinned-version routing. The choice depends on cost and state.
What about updates to the vector index?: Treat the index as a versioned artifact. Build the new index alongside the old, atomic-rename the pointer when ready, keep the old index for at least one release cycle for rollback.
How do I roll back a zero-downtime update?: Flip the version pointer back to the prior bundle. If the prior version is still warm, rollback is sub-second. If unloaded, it is a fresh load and pointer flip.
Do model upgrades need any special handling?: Yes. Pin the model snapshot in the bundle. Shadow the new model against production traffic first. Promote via canary. Keep the prior model accessible for the rollback window.

Sources

Kubernetes, "Deployments", 2025, kubernetes.io
Google SRE, "Canarying releases" (Workbook), 2018, sre.google
Martin Fowler, "Parallel Change", 2014, martinfowler.com
Anthropic, "Model deprecations", 2025, docs.anthropic.com
AWS, "Deployment strategies for AWS Lambda", 2025, docs.aws.amazon.com