What is blue-green deployment for AI agents?

Two complete environments run in parallel. Blue serves production; green carries the new prompt, model, or tool versions. Once evals on green pass, traffic switches. If something regresses, traffic switches back without a redeploy.

What about in-flight runs during the switch?

Long-running agent runs need a stateful pin so they finish on the version they started on. New runs go to the new version. Drain time is the time it takes the last pinned run to complete.

Is blue-green better than canary for agents?

Different jobs. Blue-green is binary: everyone moves at once. Canary moves a percentage and watches. Blue-green is simpler and faster to roll back; canary catches regressions that only show on a subset of traffic. Most teams use both.

Do I need duplicate infrastructure?

Only for the duration of the cutover. Modern container platforms make the second environment cheap to stand up and tear down. Storage is usually shared (with tenant-safe schemas); compute is doubled briefly.

How long does the green environment stay live?

Long enough to drain blue plus a quarantine window where rollback is fast. Typically 24 to 72 hours after the cutover. Then blue is recycled and ready to become the next green.

AI Agent Blue-Green Deployment: Safe Prompt and Model Swaps

Q: How is it different from a regular blue-green for web apps?

The artifact under deploy is not just code. It is a tuple of code, prompts, model version, retrieval index, and tool versions. Any of those changing is a release, and each has its own rollback semantics.

Blue-green deployment is older than the cloud, and it still works. The twist for AI agents is that the unit of deploy is not "the binary"; it is the bundle of code, prompts, model version, retrieval index, and tool versions. Any of those changing is a release, and each can regress in a way the others did not. The blue-green discipline gives you a clean fallback when one of them does. Companion to canary releases and to zero-downtime updates.

This piece covers the bundle, the in-flight-run pin, the switch procedure, and the rollback drill that makes the whole pattern safe.

What blue-green means for agents

The classic shape: two identical production environments. One is live; the other is idle or carrying a release candidate. Traffic switches from live to candidate via a router (load balancer, DNS, service mesh) once the candidate is ready. The old live becomes the next candidate.

For agents, the environments are not just service replicas; they are full pipelines. Each environment pins a specific code version, a specific set of prompt templates, a specific underlying model (e.g., claude-3.7-sonnet vs claude-3.5-sonnet), a specific retrieval index version, and specific versions of any custom tools. The switch flips all of them together so that production at any moment is a coherent bundle.

The release bundle

Concretely, an agent release bundle has five pieces.

Code. The orchestrator, tool definitions, guardrails. Versioned in git.
Prompts. System prompts, few-shot examples, instruction templates. Stored alongside code or in a prompt store with versioning.
Model version. The exact model name and snapshot id, pinned in config. Auto-upgrades from the provider become opt-in.
Retrieval index. The vector index version or snapshot. Embedding model changes are an index version bump.
Tool versions. External tool APIs the agent calls. Either versioned per call or pinned via the tool's own versioning.

The bundle gets a single version identifier (e.g., a release tag). Logs, traces, and bills carry the bundle id alongside the run id. When something regresses, you can find the bundle change that caused it in one query.

Stateful pins for in-flight runs

An agent run can last seconds (single tool call) or hours (deep research, multi-step automation). During a deploy, you have to decide what happens to runs already in progress. Two viable strategies.

Drain. When the switch flips, new runs go to green. In-flight runs stay on blue until they complete. Blue stays alive until all in-flight runs drain or hit a max-age cap. Drain time is bounded by the longest run plus a safety margin.

Pin. Every run is tagged with the bundle id it started on. Both blue and green can route requests; the bundle id in the run state dictates which steps execute on which bundle. Runs are stable across the switch; the deploy is faster.

Drain is simpler; pin is faster. Most platforms with sub-minute runs use drain. Platforms with multi-hour runs use pin.

The switch procedure

Build green. Stand up the second environment with the new bundle.
Smoke test. Run a small set of synthetic prompts through green. Confirm the surface is responsive.
Run evals. Held-out eval set, agreed pass thresholds. If any regression on the agreed metrics, halt. See agent evaluation metrics for the metric set.
Shadow traffic. Mirror a copy of production traffic to green without exposing responses to users. Compare green vs blue completions on the same prompts. Look for unexpected divergence.
Switch. Flip the router. New runs go to green.
Drain or pin. Handle in-flight runs per the chosen strategy.
Quarantine. Keep blue available for fast rollback for 24 to 72 hours.
Recycle. When the quarantine ends and metrics are clean, blue is recycled. Blue becomes the next green.

Rollback in under five minutes

The point of blue-green is that rollback is a switch, not a redeploy. Two preconditions.

First, the router flip must be reversible at the click of a button. If reverting requires a redeploy of code or a manual config change applied across 50 nodes, the value of the pattern is gone. A single-line config or a feature-flag toggle that the on-call can flip is the right shape.

Second, the regression signal has to fire fast. Don't wait for a customer email. Three signals that auto-trigger a rollback consideration: eval pass rate drops below threshold on shadow or live samples, p95 latency on green exceeds blue by 25 percent over a 10-minute window, or cost-per-completed-run on green exceeds blue by 50 percent.

The drill: practice a rollback on a random Wednesday afternoon when nothing is wrong. Time it. If it took longer than 5 minutes, fix the friction before the next deploy.

Blue-green vs canary

Blue-green flips all traffic at once. Canary flips a percentage. The choice is about confidence and traffic shape.

Blue-green when the change is small and evals are decisive. A new prompt that passed evals and shadow comparisons. Faster to deploy and to roll back.
Canary when the change might regress only on a subset. A model swap, a new tool, a major prompt restructure. Five percent of traffic for an hour will surface what evals missed.
Both for high-risk changes. Run a canary first; once it passes, flip blue-green for the rest.

See canary releases for the percentage-based companion. The two patterns are not exclusive; they are complementary tools that share infrastructure.

Evals are the gate

The cleanest blue-green still depends on the evals attached to the gate. Three properties matter.

Coverage. The eval set spans the task distribution the agent sees in production. If 30 percent of production runs are tool-heavy workflows and the eval set is 90 percent question-answering, the gate misses regressions on the dominant tool path.

Stability. The eval set is versioned. Adding examples for a new capability is fine; silently changing existing examples invalidates comparisons across bundles.

Sensitivity. The eval surfaces the regressions you care about. If a model swap makes outputs subtly less helpful but still technically correct, an exact-match eval will not catch it. Pair structured pass-rate evals with an LLM-as-judge or a human-feedback signal so the gate has a quality axis the structured eval misses.

Common blue-green pitfalls

Four patterns that recur in agent platform postmortems.

Shared state between blue and green. If both environments write to the same agent-memory store, a partial deploy can corrupt records. Either segment the writable state or version it under the bundle id so a rollback also rolls back the state.

Tool-version drift. The bundle promises a tool version, but the tool API itself changes server-side. Pin where you can; monitor where you cannot; alert on tool-response schema deltas.

Long-running runs that never end. A run pinned to blue but stuck in a loop holds blue alive forever. Pair every pin with a hard max-age so the deploy can complete.

Skipping shadow traffic. "Evals passed" is necessary, not sufficient. Shadow traffic catches the cases evals do not represent. Skipping it is the most common cause of "the canary was clean but production regressed".

FAQ

What is blue-green deployment for AI agents?: Two complete environments run in parallel. Blue serves production; green carries the new bundle. Traffic switches once evals pass. If something regresses, traffic switches back without a redeploy.
How is it different from blue-green for web apps?: The artifact is a bundle of code, prompts, model version, retrieval index, and tool versions. Any of those changing is a release; each has its own rollback semantics.
What about in-flight runs during the switch?: Long runs need a stateful pin so they finish on the bundle they started on. New runs go to the new bundle. Drain time is the time it takes the last pinned run to complete.
Is blue-green better than canary?: Different jobs. Blue-green is binary; canary moves a percentage. Blue-green is simpler and faster to roll back; canary catches regressions on a subset of traffic.
Do I need duplicate infrastructure?: Only during the cutover. Storage is usually shared; compute is doubled briefly. Most teams keep the second environment scaled to zero between deploys.
How long does the green environment stay live?: Long enough to drain blue plus a quarantine window for fast rollback. Typically 24 to 72 hours after the cutover.

Sources

Martin Fowler, "Blue-green deployment", martinfowler.com
Google Cloud, "Blue-green deployments with Cloud Build and Cloud Run", 2025, cloud.google.com
AWS, "Blue/Green deployments on Amazon ECS", 2025, docs.aws.amazon.com
Anthropic, "Develop tests and evaluations", 2025, docs.anthropic.com
OpenTelemetry, "GenAI semantic conventions", 2025, opentelemetry.io