Capacity planning for agent platforms looks like web-app capacity planning but with two big differences. The throughput unit is tokens-per-minute, not requests-per-second. The cost curve is steeper because each request is variable-sized and meaningfully more expensive than a typical HTTP handler. Get the sizing wrong and you either rate-limit your users or pay for headroom you do not use. Companion to load testing, rate limiting, and cost optimization.

This piece walks through the formula, the workload profile, the headroom rules, and what to watch in production. The numbers are taken from the published rate-limit tables of the major providers as of mid-2026.

The units that actually constrain capacity

Four units matter on agent platforms. In order of how often they are the binding constraint.

  1. Tokens per minute (TPM). The dominant constraint. Every major provider rate-limits by token throughput. OpenAI's Tier 5 caps at 30 million TPM for GPT-4 class models; Anthropic's higher tiers reach similar magnitudes; Bedrock provides per-region quotas that vary by model (OpenAI rate limits, 2025, Anthropic rate limits, 2025).
  2. Requests per minute (RPM). The secondary constraint. Often hit before TPM for chatty agents that make many small calls (tool use, multi-step orchestration).
  3. Concurrent requests. Some providers cap simultaneous open connections. For streaming workloads this is the constraint.
  4. Compute (your own). Your orchestrator's CPU and memory. Usually a distant fourth because the model call is where the latency lives.

A capacity formula

The simplest formula that survives contact with production.

concurrent_runs = TPM_limit / (avg_input_tokens + avg_output_tokens) * 60 / avg_run_seconds

Break it down. The TPM limit is the provider's, per model, per region. Divide by the tokens per run to get runs per minute. Divide by the run's duration in minutes to get concurrent runs.

A worked example. Provider TPM 2,000,000. Average input tokens per run 4,000 (includes system prompt, retrieved context, conversation history). Average output 500. Average run duration 8 seconds.

Concurrent runs = 2,000,000 / 4,500 × (60/8) = 444 × 7.5 = ~3,333. That is the ceiling. In practice, plan for 60 percent of the ceiling as steady-state, leaving headroom for bursts and retries. So ~2,000 concurrent runs.

Profiling your workload

The formula above only works if the inputs are real. Most teams overestimate output tokens (because LLM responses feel longer than they are) and underestimate input tokens (because retrieval context and conversation history quietly inflate). Three things to measure.

Tokens per run, p50 and p99. Plan for p99 of input tokens when sizing input-heavy workloads (RAG-heavy agents, long-context summarizers). Use p50 for output tokens; the long tail is small enough not to matter at the aggregate.

Run duration distribution. Latency p50, p95, p99. The agent platform "concurrent runs in flight" number is a function of duration, not RPS. A 30-second p99 dominates capacity if you do not account for it.

Tool-call frequency. A reasoning agent that calls 5 tools per run produces 5 model invocations per run. RPM consumption is 5x the human-facing request count. This is the most common surprise.

Headroom and retry budget

Headroom protects you from three sources of demand variance.

A 30 percent headroom (run at 70 percent of capacity at peak) handles most bursts. A 50 percent headroom is appropriate for new platforms with unstable error rates or unknown traffic profiles.

The Google SRE book defines a related concept, the error budget, which makes the trade-off explicit: you operate as if some percentage of capacity is reserved for variance, and you spend it deliberately (Google SRE Book, Chapter 3, 2016).

Handling bursts

Three controls for bursts above capacity.

  1. Bounded queue. Incoming runs queue if all concurrency slots are used. Bound the queue depth (e.g., 30 seconds of work); reject beyond it with a "try again" response. An unbounded queue is a memory leak in disguise.
  2. Exponential backoff with jitter on retries. If a model call hits a 429, retry with exponential delay plus random jitter. Without jitter, retries synchronize and spike the next-minute traffic.
  3. Fallback model with separate quota. A second provider with its own TPM quota absorbs spillover. The cost is quality drift; mark these runs as degraded in the trace.

Multi-model capacity

The fastest way to add capacity is to add a second model provider. The quotas are independent; a 2 million TPM tier on provider A plus a 1 million TPM tier on provider B gives you 3 million TPM combined.

The routing rule depends on what you optimize for.

The price of multi-model is engineering effort and prompt portability. Prompts that depend on a specific model's tool-call format need translation in the routing layer; prompts that depend on a model's tone or specific behaviors need re-eval against the fallback. Vendor evaluation covers the broader trade-offs.

Provider quota planning

Quota increases on every major provider go through a request process. Lead times vary.

The planning rule: when projected peak hits 70 percent of the current tier limit, file the upgrade. When it hits 85 percent, follow up. By 90 percent you should already be on the new tier.

What to watch in production

Five metrics that catch capacity issues before users do.

FAQ

How do you size capacity for an AI agent workload?
Multiply expected concurrent runs by tokens per run by a peak-traffic factor. Compare against the provider's TPM and RPM limits. Headroom 30 to 50 percent above peak handles bursts and retries.
What is TPM and why does it matter?
TPM is tokens-per-minute, the rate limit most providers use. It bounds throughput more than dollars do. A 2 million TPM tier caps how many concurrent runs can be in flight at once.
How do you handle traffic bursts above your provisioned capacity?
Bounded queue, retry with exponential backoff plus jitter, fallback model with separate quotas. The combination smooths bursts up to about 2x normal traffic.
How many agent runs can a single TPM tier support?
A 2 million TPM tier supports roughly 1,000 concurrent runs at 2,000 tokens per minute each. The math: TPM divided by per-run TPM. Adjust for input-heavy workloads.
Do retries count against the TPM limit?
Yes on most providers. Build retry budget into the headroom calculation. A 20 percent retry rate effectively reduces useful capacity by 20 percent.
When should I request a higher rate-limit tier?
When projected peak will exceed 70 percent of the current tier within the next quarter. Provider quota increases typically take 1 to 5 business days.

Sources