Most AI agent platforms first discover their real ceiling during a launch. The dashboard says everything is fine, the model provider's rate limiter starts throwing 429s, the retry loop multiplies the rate of incoming work, and the queue runs hot for an hour. The fix is unglamorous: load test before the launch tells you what you would have learned during it. Companion to agent uptime and reliability and agent rate limiting.

This piece is the practical playbook: how to shape representative agent traffic, what to measure, how to find the rate-limit cliff without melting the budget, and how to isolate model-provider variance from your own service's bugs.

Why agent load testing is its own discipline

An agent request is not a single HTTP call. It is a fan-out of model calls, tool calls, retrieval queries, and (sometimes) waiting on a human-in-loop. The latency distribution is the convolution of all those steps, each with its own variance. Three failure modes appear at scale that do not show in single-request testing.

  1. Provider rate-limit interaction with retries. A 429 triggers a retry, which counts against the same limit, which fails, which retries. Without jittered exponential backoff, the retry storm extends the outage.
  2. Tail latency from long completions. The p99 prompt is much longer than the p50. Streaming hides this in the user experience and exposes it in tool-call orchestration.
  3. State contention. Vector store writes, memory updates, run-state persistence. Single-request testing rarely hits the row, the index, or the document being contended for.

Shaping realistic traffic

Synthetic prompts are useful for smoke tests; they do not reveal real bottlenecks. Three sources for realistic traffic.

  1. Sampled production traces. Replay anonymized prompts from production at the desired QPS. Captures the long-tail prompt lengths, the tool-call sequences that the model actually picks, and the rare error paths.
  2. Synthetic mixes that match production distributions. If you cannot replay raw traffic for privacy reasons, generate prompts that match the production distribution of length, tool-set, and difficulty.
  3. Worst-case scenarios as a separate test. Maximum context, maximum tool-call depth, maximum retry. Run sparingly; these are the most expensive runs.

Many teams use k6, Locust, or Artillery as the driver, with a custom harness that knows how to sample from the trace store (k6 docs, 2025; Locust docs, 2025). Provider-side tools like the OpenAI evals harness or Anthropic's evaluation utilities cover end-to-end task runs at lower QPS for quality verification (OpenAI evals, 2025).

Metrics that matter

Latency is the easy one. The full set:

  1. End-to-end run latency at p50, p95, p99, p99.9. The user-facing number; the only one that matters to a customer.
  2. Per-step latency. Plan, tool call, retrieval, completion. Identifies which step owns the tail.
  3. Token throughput. Tokens per second sustained. Compares against provider quoted limits.
  4. Error rate by class. 429s, 5xx from provider, parse errors, tool-call failures, time-outs. A single "errors" number hides the real story.
  5. Tool-call success rate per tool. Some tools fail under load before others. Per-tool view surfaces the order.
  6. Cost per completed run. Live, not a post-hoc calculation. A cost dashboard during the test is the cheapest way to catch a runaway loop.

For dashboard structure, see agent monitoring and observability; the same panels you build for production work double as the load-test view.

Finding the rate-limit cliff

Provider rate limits are characterized by RPM (requests per minute) and TPM (tokens per minute), with different limits per model, sometimes per region, and with bursting allowances (OpenAI rate limits, 2025; Anthropic rate limits, 2025). The published number is a starting point; the empirical cliff is what your harness measures.

The procedure:

  1. Start at 25 percent of the published RPM. Confirm steady state, zero 429s, latency baseline established.
  2. Ramp by 25 percent every 5 minutes until you see the first 429. Record the QPS at which it appeared and the headers (Retry-After).
  3. Drop to 80 percent of the cliff QPS. Run a 30-minute steady state. Confirm zero 429s and stable latency.
  4. Push past the cliff briefly to confirm backoff and retry behavior. Time-to-recover when load drops is your "blast" number.

Document the cliff number in the capacity-planning doc. When traffic projection approaches 80 percent of the documented cliff, file a capacity ticket with the provider.

Isolating provider variance

A jump in tail latency might be your service, the model provider, the tool API, or the network. The way to know is to run the same harness against a no-op replacement at each layer. If the latency is identical with a mock provider, the issue is yours. If it shifts when you swap to a smaller model, the issue is at the provider's queue. If it stays high until you reduce concurrency, the issue is your queue.

Keep two versions of the test harness: one with the real provider, one with a fast deterministic mock. Run both on the same traffic shape. Difference is provider variance plus network; agreement means the bottleneck is upstream of the model.

Keeping the bill bounded

A single sustained load test on a frontier model at 100 RPM with average 3K tokens in and 1K tokens out can spend several hundred dollars per hour. Three controls.

  1. Hard token budget per test. Enforced in the harness. When budget hits 90 percent, the harness ramps down. When it hits 100 percent, it stops.
  2. Smaller representative model for shape tests. Use Haiku, gpt-4o-mini, or equivalent to characterize rate limits and queue behavior. Use the production model only for the final validation runs.
  3. Cached scenarios. If the test uses identical system prompts, prompt caching makes repeat runs cheap. Build the harness to be cache-friendly.

For the broader cost discipline that applies to production agents too, see AI agent cost control.

Test runbook

The template a team can copy.

  1. Scope. Which agent, which scenario, which provider, which region.
  2. Hypothesis. The number we expect to be true (e.g., "agent sustains 80 RPM at p95 less than 8 seconds"). The test confirms or refutes.
  3. Setup. Tenant id, marked traffic header, dashboards open, cost budget set, runbook for stopping.
  4. Ramp. 25 percent, 50, 75, 100 percent of target. Hold each level for 5 minutes. Capture metrics.
  5. Steady state. 30 minutes at target. Capture distributions, not averages.
  6. Stress. Push to 150 percent of target for 2 minutes. Recover. Document the failure mode.
  7. Report. p50/p95/p99 per step, error rates by class, rate-limit cliff if hit, cost spent, hypothesis verdict.

Common pitfalls in agent load testing

Five patterns that recur across load-test postmortems.

Synthetic prompts that do not match production length distribution. Most synthetic prompts cluster around the median length and miss the long-tail. Tail latency only shows on long prompts. Sample real traffic or model the length distribution explicitly.

Mock providers that hide queue behavior. A perfect-latency mock removes the queueing contention that real providers exhibit at the limit. Use the real provider for the final validation runs, not just the mocked harness.

Not testing recovery. Hitting a rate limit, hitting a circuit breaker, hitting a queue back-pressure are valuable signals. Ramping past the limit briefly is part of the test; recovery time is part of the report.

Single-region testing. Provider rate limits and capacity behave differently in different regions. If you serve users in multiple regions in production, run the load test in each region you care about.

Forgetting to clean up. Test runs leave artifacts in vector stores, memory stores, and trace systems. Without a teardown step, the next test runs against polluted state. Make the teardown part of the runbook.

FAQ

Why is load testing AI agents different from load testing a normal API?
The hot path includes a third-party LLM call with its own rate limits, queue latency, and variance. Tail latency is dominated by the model, not your service. Token cost makes brute-force load runs expensive.
What tools do people use for agent load testing?
k6, Locust, Gatling, Artillery, and JMeter for the request shape. Provider-side tools like the OpenAI evals harness for end-to-end runs. Some teams build custom harnesses to replay sampled production traces.
How do I avoid running up a huge model bill during load tests?
Use the smallest representative model that still exhibits the same rate-limit and latency surface. Sample the full payload distribution. Cap the test with a hard token budget.
What metrics matter beyond p50 latency?
p95, p99, and p99.9 latency for each step. Rate-limit error rate. Token throughput. Tool-call success rate. Cost per completed run. Failures by class.
How do I handle rate limits in a load test?
Discover the limit empirically with a ramp. Stay just below it for steady-state. Then push past it briefly to test backoff, retry, and queue behavior.
Can I load test in production?
Yes with care: a dedicated test tenant, marked traffic, hard cost ceiling, and a stop runbook. Most teams start in staging and graduate to production load tests once the harness is trusted.

Sources