How to Handle Rate Limits in an AI Agent

Rate limits are the boring failure mode that takes out demos when they go to production. The agent runs fine on five test requests, then 500 requests arrive in a minute and a third silently disappear. This guide is the production pattern: where the limits hit, how to back off, how to queue, what to persist, and which metrics warn you 30 minutes before everything melts.

Specific limits cited come from provider documentation as of May 2026: Anthropic (docs.anthropic.com/en/api/rate-limits), OpenAI (platform.openai.com/docs/guides/rate-limits), Slack (api.slack.com/docs/rate-limits), Gmail (developers.google.com/gmail/api/reference/quota), Stripe (docs.stripe.com/rate-limits). All retrieved 2026-05-09.

The five rate-limit surfaces

Production agents hit limits at all five surfaces. Missing any one produces silent failures during bursts.

LLM provider. Anthropic, OpenAI, Google, Cohere, your inference provider. Both request-per-minute and tokens-per-minute apply.
Embedding provider. Often the same vendor as the LLM but a separate quota and tier. Embeddings during retrieval can rate-limit independently of the agent's main loop.
Destination tools. Slack, Gmail, Salesforce, HubSpot, Stripe, your CRM, your email-send service. Each has its own scheme. The cluster post on connecting an agent to Slack covers Slack's specifics; analogous limits exist on every API the agent touches.
Database or vector store. Postgres connection pool, Pinecone QPS, Redis throughput. Often a function of the plan you are on, not just the API behaviour.
Your own concurrency caps. Worker count, semaphore size, async pool size. These exist intentionally and must be respected by your queueing layer or you will OOM the runtime.

Read the headers

Every well-behaved API returns rate-limit information in headers. The agent's HTTP layer should parse and use them.

Header	Meaning
`Retry-After`	Seconds (or HTTP-date) to wait before retry. Mandatory respect on 429 and 503.
`X-RateLimit-Limit`	Total quota in the current window.
`X-RateLimit-Remaining`	Calls left in the current window.
`X-RateLimit-Reset`	Unix timestamp or seconds until reset.
`anthropic-ratelimit-requests-remaining`	Anthropic-specific request budget.
`anthropic-ratelimit-tokens-remaining`	Anthropic-specific token budget.

Use the headers to back off proactively, not just reactively. If X-RateLimit-Remaining drops below 10 percent of X-RateLimit-Limit, slow your dispatch rate before the 429 arrives. Reactive-only backoff is twice as expensive in latency for the same throughput.

Exponential backoff with jitter

The retry pattern that survives:

Initial delay: 1 second.
Double on each failure: 1, 2, 4, 8, 16, 32.
Jitter: ±30 percent on each delay (so 100 retrying clients do not retry at the same instant and re-trigger the limit).
Cap at 60 seconds.
Maximum 5 retries; after that, give up and surface the failure.

If the response includes Retry-After, use that value (with jitter) instead of the exponential schedule. The provider knows when capacity will be available; trust them.

Do not retry forever. Persistent rate limits indicate capacity exhaustion, not a transient hiccup. The agent should abort the action, log the failure with full context, and surface the issue to the orchestrator. The cluster post on agent failure modes covers when to give up vs continue.

Queue with bounded depth

A token-bucket rate limiter sized to the provider's documented limit smooths bursts into a steady flow. Queue requests that arrive faster than the bucket allows; dispatch when tokens are available.

Bound the queue depth. An unbounded queue accumulates work indefinitely; eventually you OOM and lose everything. Pick a depth based on your maximum acceptable latency: if the queue is at depth N and processing rate is R per second, the worst-case latency is N/R seconds. Reject incoming requests when the queue is full and surface the rejection with a clear error code (HTTP 429 or 503 to the caller, depending on whether the rejection is transient).

For agent contexts where the user is waiting for a response, the timeout should match the user's tolerance: 30 seconds for chat, 5 minutes for batch. For background contexts (scheduled runs, async tasks), longer queues are acceptable but never unbounded.

Persist the queue

An in-memory queue dies on process restart. For anything more important than a chat reply, persist the queue:

Use a durable queue (Redis Streams, AWS SQS, GCP Pub/Sub, RabbitMQ, your existing job framework).
Each queued action gets a UUID and a status (pending, in-flight, succeeded, failed).
On restart, recover in-flight items and retry under the same retry policy.
Track per-item retry count; items that exhaust retries move to a dead-letter queue for manual review.

The cluster post on monitoring agent activity covers the dashboards that should watch the queue and dead-letter rates. Persistent dead-letter growth is a signal that the agent is producing actions the destination cannot accept; the agent prompt or the action allow-list typically needs adjustment.

Metrics that warn early

Five metrics worth alerting on:

Queue depth (per surface). Sustained growth means the dispatch rate cannot keep up with arrival rate. Action: scale the worker pool, raise the limit with the provider, or shed load.
429 / 503 rate (per surface). A baseline non-zero rate is normal; a doubling is the leading indicator.
Retry rate. Should be low single-digit percent in steady state. If 30 percent of requests retry, the dispatch rate is too aggressive.
Dead-letter queue size. Should be near zero. Sustained growth means the agent is producing rejected actions; investigate the agent.
End-to-end latency p95. The user-facing measure. If the queue absorbs bursts at the cost of latency, the user notices p95 first.

Production checklist

Before shipping an agent that talks to external services, walk this checklist methodically:

Map every external call to its rate-limit surface and document the limit per surface.
Implement Retry-After respect in your HTTP client at the lowest layer.
Add exponential backoff with jitter as the default retry policy across all calls.
Wrap each surface in a token-bucket rate limiter sized to its documented limit.
Bound queue depth and reject overflow with a clear error code, never enqueue without bound.
Persist queues to durable storage and recover in-flight items on restart.
Set per-item retry budgets and route exhausted items to a dead-letter queue for review.
Instrument the five metrics: queue depth, 429 rate, retry rate, dead-letter size, p95 latency.
Configure alerts on sustained queue growth and dead-letter accumulation rather than thresholds.
Run a load test that exceeds your worst-case burst and verify graceful degradation, not silent loss.

Skipping any step is a future incident. The cluster post on how we test AI agents covers the broader test suite that exercises rate-limit handling under load. Treat the checklist as a hard gate before production deploy, not a nice-to-have.

Frequently asked questions

Where do AI agents hit rate limits?

Five surfaces: the LLM provider (Anthropic, OpenAI, Google), the embedding provider, the destination tools (Slack, Gmail, Salesforce), the database or vector store under the agent, and the agent's own concurrency caps. Production agents must handle limits at all five surfaces; missing any one produces silent failures during bursts.

What is the right retry pattern for rate limit errors?

Exponential backoff with jitter, respecting the Retry-After header when provided. Start at 1 second, double on each retry, jitter +/- 30 percent, cap at 60 seconds, and abort after 5 retries. Persistent rate limits indicate capacity exhaustion, not a transient hiccup; abort and surface to the orchestrator rather than retrying forever.

How should an agent queue requests during a burst?

Use a token-bucket rate limiter sized to the provider's documented limit, with a queue depth bounded by your maximum acceptable latency. Reject the request immediately if the queue is full rather than enqueuing without bound. Track queue depth as a metric; sustained queue growth is the leading indicator of a capacity problem you have 30 minutes to address.

Why do agents lose messages during rate-limit events?

Three reasons. First, retry without persistence: the agent retries in memory and a process restart loses the queue. Second, infinite retry without timeout: the agent burns retry budget on a permanently failed call. Third, silent error swallowing: the LLM produces a tool call, the tool returns 429, the agent moves on without logging the failure. Fix all three: persist the queue, set timeouts, log every 4xx and 5xx with full context.

What rate limits should I plan for on production agents?

Anthropic's tier-3 limit is 4,000 requests per minute and 80 million input tokens per minute on Claude Sonnet 4 as of May 2026. OpenAI's tier-5 limit is 30,000 requests per minute. Slack's chat.postMessage is roughly 1 message per second per channel. Gmail send is 250 quota units per user per second. Stripe is 100 requests per second per account. Map your agent's worst-case burst to all of these before launch.

Three takeaways before you close this tab

Plan for five surfaces. One missed surface is one silent loss.
Backoff with jitter respects Retry-After, caps retries.
Bound the queue, persist it, watch the depth.

Sources

Anthropic rate limits, retrieved 2026-05-09, docs.anthropic.com/en/api/rate-limits
OpenAI rate limits, retrieved 2026-05-09, platform.openai.com/docs/guides/rate-limits
Slack rate limiting, retrieved 2026-05-09, api.slack.com/docs/rate-limits
Gmail API quota, retrieved 2026-05-09, developers.google.com/gmail/api/reference/quota
Stripe rate limits, retrieved 2026-05-09, docs.stripe.com/rate-limits