Per-task token cost is the cheapest, easiest, and most misleading number you can quote when estimating an agent's deployed cost. The number is right for a single successful invocation in development. It is wrong, often by 1.5-2x, for the cost the agent will actually incur in production. The four-factor model below produces a number you can budget against.
The factors are mean tokens per successful invocation, retry rate, observability overhead, and tail-latency timeouts. Each is measurable, each is small individually, and together they explain why early agent deployments routinely exceed budget. Anthropic and OpenAI both publish their per-token pricing transparently (Anthropic, "Model pricing"; OpenAI, "API pricing"); the rest of the math is workload-specific.
Why per-token estimates lie
The number you get from a one-off test run captures a successful invocation in development. The agent reads the input, calls the right tools, produces the output, finishes cleanly. You sum the prompt tokens and completion tokens, multiply by the published per-token rates, and you have an estimate.
The estimate is right for that one run. It is wrong for production for three reasons. First, production has retries, which the test run did not. Second, production has logging, which the test run minimised. Third, production has long-tail invocations that take 5-10x the mean and sometimes time out before completing. None of those costs appear in the per-task baseline.
The cost framework in economics of bootstrapped AI agents covers the broader unit-economics picture. This post is the operational version: how to produce a number that matches the deployed reality.
Factor 1: Mean tokens per successful invocation
Run the agent ten times in development on representative inputs. Sum the prompt tokens and completion tokens for each run. Take the mean. Multiply by the model's published per-token rates. This is the baseline.
Why ten runs and not one? Token usage varies across inputs. The mean across ten runs is a more reliable baseline than any single run. Use representative inputs, not edge cases; the baseline is for typical operation. The cost-model categories in AI agent cost models determine which pricing applies (per-token, per-task, per-capability).
Factor 2: Retry rate and amplification
Retries happen when the agent's first attempt fails or produces output that fails validation. The agent runs again with the same input or with a refined prompt. Each retry costs roughly the same as a fresh invocation.
Retry rates vary widely by task. On a well-understood task with a stable prompt, retry rate is 2-5%. On a new task with an ambiguous prompt, retry rate is 20-30%. The retry rate compounds onto the baseline: a 20% retry rate means amortised cost is 1.2x baseline.
Measure retry rate during the supervised first ten runs (the calibration window covered in how to set up your first AI agent). Use the measured rate. Guessing low produces an estimate that the deployment will exceed; guessing high produces an estimate that looks worse than necessary.
Factor 3: Observability and audit-log overhead
Logging the agent's reasoning trace, tool calls, and outputs is required for debugging (covered in how to debug an AI agent) and for audit (covered in how to give an agent email access safely). The logging is not free.
Three components contribute. First, the structured trace itself is additional tokens that the agent or the runtime emits. Second, the storage destination charges for ingest and retention. Third, the indexing layer (if you query traces) adds storage and query cost.
Across realistic workloads, observability adds 10-30% to per-task cost. The lower bound is minimal logging (action and result only). The upper bound is full reasoning trace plus structured tool I/O. Pick your level and budget the percentage. Skipping observability is not a real option for production agents; the cost of debugging without traces is worse than the 30% overhead.
Factor 4: Tail-latency timeouts
Some agent runs take 5-10x the mean. Some hit the configured timeout before completing. Tail-latency timeouts are 1-3% of invocations on most workloads, but their per-event cost is high because they typically stop near the end of long-running tasks (you have already paid most of the tokens before the timeout fires).
Estimate by measuring the 95th and 99th percentile latency in the supervised window, comparing against the configured timeout, and counting how many runs would have hit the cap. Each capped run incurs the partial cost up to the cap. Sum across the expected daily volume to get the per-day timeout cost.
Worked example: daily Slack triage agent
A daily Slack triage agent processes 50 messages per morning, classifying each by urgency and producing a digest. Baseline measurement: mean 3,200 prompt tokens and 200 completion tokens per invocation, with one invocation per message. At Anthropic's mid-tier model pricing, the per-task baseline is approximately $0.04.
Layer the four factors. Retry rate measured at 20%: adds $0.008 per task. Observability at 30% overhead: adds $0.012. Tail-latency timeouts at 3% of invocations costing 50% of baseline each: adds $0.002. Amortised deployed cost: $0.062 per task, or roughly $3.10 per day for the 50-message workload.
Per-task baseline ($0.04) underestimated by 55%. The compounded model produces $0.062, which matches what the deployment actually spends. Budget against the compounded model, not the baseline. Pricing for both Anthropic and OpenAI is published and reliable enough to rebuild this estimate for any model on any workload.
Frequently asked questions
Why do per-task AI agent cost estimates fail?
Per-task estimates capture only the successful-invocation cost. They miss retries (which can double or triple the per-task spend), observability and audit-log overhead (which adds 10-30%), and tail-latency timeouts (where the agent times out partway through and the partial cost is still incurred). The realistic cost is mean-case plus a multiplier for these factors.
What are the four cost factors for an AI agent?
Mean tokens per successful invocation (the baseline), retry rate and amplification (how often the agent retries and how much each retry costs), observability and audit-log overhead (the cost of logging each tool call), and tail-latency timeouts (the partial cost incurred when the agent times out before completing). Add all four to get a deployable estimate.
How much does the retry rate add to AI agent cost?
Depends on the task. A reliable agent on a well-understood task retries 2-5% of invocations. A new agent on an ambiguous task can retry 20-30%. Each retry costs roughly the same as a fresh invocation. A 20% retry rate adds 20% to amortised cost. Measure retry rate during the supervised first ten runs and use the measured rate, not a guess.
Should I include observability cost in my AI agent estimate?
Yes. Logging the agent's reasoning trace, tool calls, and outputs adds 10-30% to per-task cost depending on the verbosity of logging and the storage destination. Skipping observability cost in estimates is the most common reason real-world agent budgets exceed projections.
What is a tail-latency timeout in agent cost?
When an agent run takes longer than the configured timeout, the agent stops partway through and you have already paid for the tokens consumed up to that point. Tail-latency timeouts are 1-3% of invocations on most workloads but their per-event cost can be high because they stop near the end of long-running tasks. Budget the partial cost as a separate line.
Three takeaways before you close this tab
- Baseline plus retries plus observability plus tail. All four factors. Skipping any underestimates 5-30%.
- Measure during the supervised first ten runs. Use measured rates, not guesses.
- Budget against the compounded estimate, not the baseline. The compound is what production spends.
Sources
- Anthropic, "Model pricing", retrieved 2026-05-07, anthropic.com/pricing
- OpenAI, "API pricing", retrieved 2026-05-07, openai.com/api/pricing
- OpenAI, "Function calling and tool use", retrieved 2026-05-07, platform.openai.com/docs/guides/function-calling
- Aryan Agarwal, "Gravity cost methodology", internal v1, May 2026, About