The point of a canary is to learn things evals cannot. Evals run on a held-out set; production runs on whatever showed up today. Some regressions are visible only at production scale, on production traffic shapes, with production users in the loop. A canary lets you see those regressions on 1 to 10 percent of traffic before the other 90 percent finds them. Companion to blue-green deployment and to agent evaluation metrics.
This piece covers the routing, the window, the promotion gate, the expansion schedule, and the abort criteria that make canaries safe rather than theatrical.
What canary release means for agents
The canary pattern is borrowed from web services: route a small percentage of traffic to the new version while keeping the bulk on the known-good version. For an agent, the new version can be a code change, a prompt change, a new model, a new retrieval index, or any combination. The pattern is the same; the metrics shift.
The thing that makes canaries valuable for agents is that LLM behavior is sensitive to inputs in ways evals do not always capture. A new prompt that passes 100 eval examples can still produce systematically worse completions on a particular user segment, on prompts that exceed a certain length, or on the rare tool-call sequence the eval set did not include. The canary surface lets you see those before the change is universal.
How to route 1 to 10 percent
Three viable mechanisms.
- Hash-based. Hash the user id or run id; the lowest N percent of hash space goes to canary. Stable: the same user gets the same version repeatedly. Required for any change a user would notice.
- Random per request. Roll the dice on each request. Simpler; useful when the change is invisible to the user and the run is short.
- Feature-flag. A managed flagging service (LaunchDarkly, Statsig, Unleash) handles routing with targeting rules (LaunchDarkly docs, 2025). Adds a network hop but gives observability and quick toggles.
For agents, hash-based on tenant or user id is the safest default. It avoids the "same user gets two different behaviors on consecutive requests" problem, which makes debugging painful and trust erode.
Window size and statistical power
The temptation is to run a 5-minute canary, see no errors, and promote. Resist. A 5-minute window catches infrastructure regressions and misses everything subtle. The right window is "long enough to accumulate enough samples for the metrics that matter".
A rough sizing rule: if you want to detect a 5 percent regression in task success rate at 80 percent statistical power, you need on the order of 2,000 samples per arm at typical base rates. A canary that produces 200 runs per hour needs at least 10 hours at the canary percentage to be confident. Lower-volume agents may need overnight canaries; very-high-volume agents can compress the window.
One pragmatic compromise. Run the canary at low percentage for long enough to detect medium effects (say 10 percent regression). Promote to a larger percentage to confirm. The cost of a 1-day canary on a non-urgent change is usually small compared to the cost of a bad promotion.
The promotion gate
Promotion requires every gate metric within an agreed band of control. Six gates that work across most agent platforms.
- Task success rate. The output is correct (eval judge, explicit label, downstream signal). Within 1 to 3 percentage points of control.
- p95 end-to-end latency. Within 10 percent of control. Tail matters more than mean.
- Error rate by class. 5xx, parse failures, tool-call errors. Within 20 percent relative.
- Tool-call success per tool. Per-tool, not aggregate. A specific tool's failure rate doubling is hidden in the aggregate.
- Cost per completed run. Within 15 percent. A model swap that improves quality but doubles cost may still be the wrong call.
- Quality proxy. LLM-as-judge or a held-out labeled eval set run on canary completions. Within agreed band.
Bands are negotiated up front. "Better than control" is rarely the bar; "not worse than control by more than X" usually is.
Expansion schedule
The schedule that works for most teams.
- 1 percent for 1 to 24 hours. Smoke test on real traffic. Watch.
- 5 percent for 12 to 24 hours. Increase signal. Confirm the patterns from 1 percent hold.
- 25 percent for 12 to 24 hours. First serious load test on the new bundle.
- 50 percent for 12 to 24 hours. Half and half. Strong statistical comparison; harder to roll back without users noticing.
- 100 percent. Promoted. Old version stays available for quick rollback for 24 to 72 hours.
Each step has its own gate evaluation. Either advance, hold, or roll back. Holding is a legitimate decision; not every change needs to ship in a day.
Abort criteria
The canary should abort automatically on any of:
- Task success rate drops more than the agreed band relative to control.
- p95 latency exceeds the agreed band relative to control.
- Error rate by any class spikes (e.g., 5xx doubles for more than 10 minutes).
- Cost per completed run exceeds the budget headroom you set.
- Any single customer in the canary sample raises a support ticket about behavior change. (Manual override, but tracked.)
Abort means revert routing. The on-call gets a notification, the deploy record is updated, the change goes back to the planning stage. Aborts are normal; treat them as cheap learning, not as a failure. See agent incident response for how to handle the post-abort follow-up.
Statistical comparison, not eyeballing
The "the canary looks fine" judgment is the riskiest decision point in the whole pattern. Three habits that replace it with something falsifiable.
Pre-register the metrics and bands. Before the canary starts, write down which metrics are gates, what the agreed bands are, and what the decision is at each step. Decisions made after the data are easier to rationalize.
Use proportions, not means, where applicable. Success rate is a proportion; comparing canary and control needs a proportion test (Fisher's exact, chi-square). For latencies, compare percentiles, not averages.
Accept that some canaries are inconclusive. A 3-hour window may not produce enough samples to detect a 5 percent regression. Either extend or accept the limit and document the residual risk.
Common canary pitfalls
Four patterns that show up in agent canary postmortems.
Cohort skew. The hash function bins certain users into canary at higher rates than others. A heavy enterprise customer lands entirely on canary; metrics swing on one customer's traffic. Use a properly distributed hash on a stable key.
Cache pollution. The canary writes to a shared cache that the control reads from. Canary content reaches control users. Either tag cache entries with bundle id or partition the cache.
Comparing on aggregate when you should compare on segment. Aggregate numbers hide segment-level regressions. Break out by tenant size, by capability, by region; the regression often hides in one segment.
Forgetting to clean up the routing config. A "temporary" 5 percent canary lingers in the config for months. Every routing rule has a date; stale rules are reviewed and deleted on a schedule.
FAQ
- What is a canary release for an AI agent?
- A small percentage of production traffic is routed to the new prompt, model, or bundle. Metrics are compared against the control population. If metrics meet thresholds, the canary expands; if they regress, it is rolled back.
- What percentage should I start at?
- 1 percent for high-volume agents. 5 to 10 percent for lower-volume agents where 1 percent is too small for statistical significance.
- How long does a canary window run?
- Long enough to accumulate enough samples for statistical comparison: typically 1 to 24 hours depending on traffic.
- What metrics matter for promotion?
- Task success rate, p95 latency, error rate, tool-call success, cost per completed run, and a quality proxy. All within an agreed band of control.
- How is this different from a feature flag rollout?
- A feature flag is on or off per user. A canary is a controlled percentage routing with a comparison harness measuring metrics. Feature flags can be a canary mechanism; the comparison and auto-promote logic distinguish a canary.
- What if the regression appears only at 50 percent?
- Watch metrics at each step, not just at the start. Auto-pause if any metric drifts outside the band at any expansion. Some regressions only show under load, which is why the expansion is staged.
Sources
- Martin Fowler, "Canary release", martinfowler.com
- Google, "Site Reliability Engineering: Release Engineering", sre.google
- LaunchDarkly, "Feature flags documentation", 2025, docs.launchdarkly.com
- Anthropic, "Develop tests and evaluations", 2025, docs.anthropic.com
- OpenTelemetry, "GenAI semantic conventions", 2025, opentelemetry.io
