Capacity planning for agent platforms looks like web-app capacity planning but with two big differences. The throughput unit is tokens-per-minute, not requests-per-second. The cost curve is steeper because each request is variable-sized and meaningfully more expensive than a typical HTTP handler. Get the sizing wrong and you either rate-limit your users or pay for headroom you do not use. Companion to load testing, rate limiting, and cost optimization.
This piece walks through the formula, the workload profile, the headroom rules, and what to watch in production. The numbers are taken from the published rate-limit tables of the major providers as of mid-2026.
The units that actually constrain capacity
Four units matter on agent platforms. In order of how often they are the binding constraint.
- Tokens per minute (TPM). The dominant constraint. Every major provider rate-limits by token throughput. OpenAI's Tier 5 caps at 30 million TPM for GPT-4 class models; Anthropic's higher tiers reach similar magnitudes; Bedrock provides per-region quotas that vary by model (OpenAI rate limits, 2025, Anthropic rate limits, 2025).
- Requests per minute (RPM). The secondary constraint. Often hit before TPM for chatty agents that make many small calls (tool use, multi-step orchestration).
- Concurrent requests. Some providers cap simultaneous open connections. For streaming workloads this is the constraint.
- Compute (your own). Your orchestrator's CPU and memory. Usually a distant fourth because the model call is where the latency lives.
A capacity formula
The simplest formula that survives contact with production.
concurrent_runs = TPM_limit / (avg_input_tokens + avg_output_tokens) * 60 / avg_run_seconds
Break it down. The TPM limit is the provider's, per model, per region. Divide by the tokens per run to get runs per minute. Divide by the run's duration in minutes to get concurrent runs.
A worked example. Provider TPM 2,000,000. Average input tokens per run 4,000 (includes system prompt, retrieved context, conversation history). Average output 500. Average run duration 8 seconds.
Concurrent runs = 2,000,000 / 4,500 × (60/8) = 444 × 7.5 = ~3,333. That is the ceiling. In practice, plan for 60 percent of the ceiling as steady-state, leaving headroom for bursts and retries. So ~2,000 concurrent runs.
Profiling your workload
The formula above only works if the inputs are real. Most teams overestimate output tokens (because LLM responses feel longer than they are) and underestimate input tokens (because retrieval context and conversation history quietly inflate). Three things to measure.
Tokens per run, p50 and p99. Plan for p99 of input tokens when sizing input-heavy workloads (RAG-heavy agents, long-context summarizers). Use p50 for output tokens; the long tail is small enough not to matter at the aggregate.
Run duration distribution. Latency p50, p95, p99. The agent platform "concurrent runs in flight" number is a function of duration, not RPS. A 30-second p99 dominates capacity if you do not account for it.
Tool-call frequency. A reasoning agent that calls 5 tools per run produces 5 model invocations per run. RPM consumption is 5x the human-facing request count. This is the most common surprise.
Headroom and retry budget
Headroom protects you from three sources of demand variance.
- Normal traffic bursts. Most platforms see 2x peak-to-mean traffic on a daily cycle. Some see 5x.
- Retry traffic. A 5 percent error rate with 1 retry per error means actual traffic is 5 percent higher than user-visible traffic. Higher error rates compound.
- Multi-step amplification. A 7-step agent run that retries one failed step retries the step, not the whole run. The retry budget is per-step, not per-run.
A 30 percent headroom (run at 70 percent of capacity at peak) handles most bursts. A 50 percent headroom is appropriate for new platforms with unstable error rates or unknown traffic profiles.
The Google SRE book defines a related concept, the error budget, which makes the trade-off explicit: you operate as if some percentage of capacity is reserved for variance, and you spend it deliberately (Google SRE Book, Chapter 3, 2016).
Handling bursts
Three controls for bursts above capacity.
- Bounded queue. Incoming runs queue if all concurrency slots are used. Bound the queue depth (e.g., 30 seconds of work); reject beyond it with a "try again" response. An unbounded queue is a memory leak in disguise.
- Exponential backoff with jitter on retries. If a model call hits a 429, retry with exponential delay plus random jitter. Without jitter, retries synchronize and spike the next-minute traffic.
- Fallback model with separate quota. A second provider with its own TPM quota absorbs spillover. The cost is quality drift; mark these runs as degraded in the trace.
Multi-model capacity
The fastest way to add capacity is to add a second model provider. The quotas are independent; a 2 million TPM tier on provider A plus a 1 million TPM tier on provider B gives you 3 million TPM combined.
The routing rule depends on what you optimize for.
- Cost-optimized: Default to the cheaper provider. Spill to the more expensive one when the cheap one rate-limits.
- Quality-optimized: Default to the preferred model. Fall back to the second only on failure.
- Latency-optimized: Default to the lower-latency region or provider. Fall back on RPS or latency degradation.
The price of multi-model is engineering effort and prompt portability. Prompts that depend on a specific model's tool-call format need translation in the routing layer; prompts that depend on a model's tone or specific behaviors need re-eval against the fallback. Vendor evaluation covers the broader trade-offs.
Provider quota planning
Quota increases on every major provider go through a request process. Lead times vary.
- OpenAI tier progression: Automatic based on spend thresholds. The highest tiers may require sales engagement and take a week to confirm.
- Anthropic: Quota increases via the customer console; standard requests typically resolve within a business week.
- AWS Bedrock: Per-region quotas in Service Quotas. Increase requests can take 1 to 5 business days, longer for the largest tiers.
- Google Vertex AI: Per-region quotas in IAM & Admin. Increases via the quota request UI; lead time 1 to 7 business days.
The planning rule: when projected peak hits 70 percent of the current tier limit, file the upgrade. When it hits 85 percent, follow up. By 90 percent you should already be on the new tier.
What to watch in production
Five metrics that catch capacity issues before users do.
- TPM utilization. Actual TPM / TPM limit, per provider, per model. Alert at 70, 85, 95 percent.
- 429 rate. Percentage of model calls that hit a rate limit. Alert above 1 percent.
- Queue depth. If you have a queue, the depth and the wait time at p99. Alert above your design SLO.
- Retry rate. Percentage of calls that retry at least once. Sustained above 5 percent indicates a real degradation.
- p99 run duration. Increases here precede capacity exhaustion; the queue is forming before the rate-limiter sees it.
FAQ
- How do you size capacity for an AI agent workload?
- Multiply expected concurrent runs by tokens per run by a peak-traffic factor. Compare against the provider's TPM and RPM limits. Headroom 30 to 50 percent above peak handles bursts and retries.
- What is TPM and why does it matter?
- TPM is tokens-per-minute, the rate limit most providers use. It bounds throughput more than dollars do. A 2 million TPM tier caps how many concurrent runs can be in flight at once.
- How do you handle traffic bursts above your provisioned capacity?
- Bounded queue, retry with exponential backoff plus jitter, fallback model with separate quotas. The combination smooths bursts up to about 2x normal traffic.
- How many agent runs can a single TPM tier support?
- A 2 million TPM tier supports roughly 1,000 concurrent runs at 2,000 tokens per minute each. The math: TPM divided by per-run TPM. Adjust for input-heavy workloads.
- Do retries count against the TPM limit?
- Yes on most providers. Build retry budget into the headroom calculation. A 20 percent retry rate effectively reduces useful capacity by 20 percent.
- When should I request a higher rate-limit tier?
- When projected peak will exceed 70 percent of the current tier within the next quarter. Provider quota increases typically take 1 to 5 business days.
Sources
- OpenAI, "Rate limits", 2025, platform.openai.com
- Anthropic, "Rate limits", 2025, docs.anthropic.com
- AWS, "Amazon Bedrock quotas", 2025, docs.aws.amazon.com
- Google Cloud, "Vertex AI generative AI quotas", 2025, cloud.google.com
- Google SRE, "Embracing Risk", SRE Book, 2016, sre.google
