Most AI agent ROI math undercounts cost and overcounts value. The honest version puts everything on the table: token cost, tool calls, infra, observability, retry overhead, and human-in-loop time on the cost side; hours saved, error reduction, cycle-time gain, and opportunity cost on the value side. Deloitte's State of Generative AI work and MIT Sloan's enterprise AI tracking both flag a recurring trap where per-task cost rises above the loaded human cost of the same task (Deloitte, 2024; MIT Sloan, 2024).

This post gives the cost stack line by line, the value stack line by line, a payback-period calculator with a worked example, and the three honest cases where an AI agent is the wrong answer. Fork the math into your spreadsheet. No vendor will hand it to you with the inconvenient lines on it.

The cost stack, line by line

Six line items. Most teams budget for two and discover the other four after launch.

  1. Token cost. Input plus output, base rate, before any caching or batching discounts. Pull the public price card; multiply by tokens-per-run; multiply by run volume.
  2. Tool-call cost. Anything the agent pays a third party for: paid search, premium data feeds, SMS, Stripe transfers, vendor-side reranking, third-party agent fees.
  3. Infrastructure. Hosting the orchestrator, queues, storage of traces and memory, egress. Usually 5 to 15 percent of inference spend at moderate scale.
  4. Observability. Trace storage and the bills from Langfuse, Langsmith, Arize, Helicone, or Datadog LLM Observability. Linearly grows with run volume.
  5. Retry overhead. The most undercounted line. A 20 percent retry rate adds 20 percent to inference cost, plus retried tool calls, plus the human review the retry triggers. 1.2x to 1.6x multiplier is typical.
  6. Human-in-loop time. Loaded review cost per gated action. Often the biggest single cost on safety-critical workflows. Bills as time, not API spend, but shows up in the same P&L.

The value stack, line by line

Four line items, ordered roughly by how defensibly they can be measured.

  1. Hours saved. Loaded hourly cost of the human whose work the agent absorbs. Use fully loaded compensation (salary plus benefits plus overhead), not headline salary.
  2. Error reduction. Cost of errors the human made that the agent does not, minus cost of errors the agent makes that the human did not. Net is sometimes negative early.
  3. Cycle-time gain. Revenue or margin associated with faster turnaround. Most defensible on sales-cycle and support-resolution use cases.
  4. Opportunity cost. Work that would not have happened at all without the agent. Hardest to defend, but real. Treat it as upside, not base case.

Per-task vs per-seat vs outcome-based pricing

The three pricing models map to three different ROI shapes. Per-task is easiest to compare against a per-task human cost; per-seat is cheaper at scale but locks in fixed cost; outcome-based aligns vendor incentives but is hardest to forecast. For a deeper treatment of the pricing axes themselves see AI agent cost models explained.

ModelBest forRisk
Per taskVariable volume, easy benchmarkingSpikes blow the budget
Per seatSteady internal use, multi-featureUnderused seats pay anyway
Outcome-basedAligned vendor incentivesDefinition disputes when outcomes are fuzzy

The "agent that costs more than the human" trap

The single worst pattern. A team builds an agent for a task where the loaded human cost is $1.40 per task. The agent ends up costing $1.85 per task once retries and human review are counted. The team ships anyway because the build is sunk cost. Twelve months later they have spent more on the agent than they saved, plus the build cost, plus the cost of explaining it.

The defense is to compute fully-loaded per-task cost on both sides before the build. If the agent's per-task cost (token plus tool plus retry plus review) does not beat the loaded human cost by at least 30 percent, do not build. The 30 percent buffer absorbs the prompt drift, price drift, and data drift you will experience in production.

How retry overhead, observability, and human-in-loop silently eat margin

Three silent eaters. Retries: a 20 percent retry rate means 20 percent more inference, 20 percent more tool calls, and a higher rate of triggered human review. Observability: trace storage costs scale with run volume, and the temptation to retain 90 days of full-fidelity traces dies the first time finance sees the bill. Human-in-loop: a 5-minute review per gated action at $40 loaded is $3.33 per action; ten gated actions per day per reviewer is $33 per day per reviewer, $660 per month per reviewer, before the agent runs a single new task.

The payback-period calculator: a worked example

Take an inbound support triage agent. Volume: 4,000 tickets per month. Replaces 60 percent of L1 triage work. L1 loaded cost: $32 per hour. Tickets per hour at L1: 6. So loaded human cost per ticket is roughly $5.33. Agent cost per ticket: token $0.08, tools $0.03, retry overhead 30 percent so add $0.03, observability $0.01, human review of 8 percent gated actions at $4.27 each so $0.34. Total per ticket: $0.49. Net saving per ticket: $4.84. Monthly: 4,000 × 0.6 × $4.84 = $11,616.

Build cost: 200 engineering hours at $120 loaded = $24,000. Payback: $24,000 / $11,616 = 2.07 months. Sensitivity check: if retry overhead doubles to 60 percent and review rate doubles to 16 percent, per-ticket cost rises to $0.85 and payback extends to 2.36 months. Still acceptable. If volume drops 50 percent, payback doubles to 4.13 months. Document the sensitivity, not just the base case.

When an AI agent is NOT worth building

Three honest cases. First, cost inversion: per-task cost exceeds loaded human cost. Use the human. Second, asymmetric failure cost: medical advice, legal filings, regulated trades, irreversible financial transactions where a single bad output costs more than the entire annual saving. Use the licensed professional and the existing process. Third, rare-task non-amortization: the task runs 50 times a year. The build cost does not amortize. Keep the SOP.

The CFO conversation template

Five slides, no more. (1) The task today: volume, cost, error rate. (2) The agent's design and the cost stack. (3) The value stack with the assumptions named. (4) Payback period with sensitivity on the two scariest assumptions. (5) Kill criteria: at what numbers do we stop. The kill slide is what earns the budget. Vendors that cannot answer slide 5 do not get the contract.

How to measure ROI continuously

ROI is not a launch-day artifact. Tag every run with the same cost and value attributions used in the business case. Aggregate on the warehouse. Show monthly: actual cost per task vs forecast, actual hours saved vs forecast, retry rate vs forecast, review rate vs forecast. The first month after launch will be 30 to 50 percent worse than the business case. The third month is where the truth lives. Re-baseline every quarter.

Frequently overlooked: the AI failure cost

The line item nobody puts on the cost stack. A wrong refund, a wrong escalation, a wrong customer-facing message, all cost real dollars and real CSAT. Estimate the wrong-answer rate, multiply by the cost of a wrong answer (support ticket, refund, churn risk), and add it to the cost stack. If you cannot estimate it, run a shadow mode for a month and measure it. Then put it on the slide.

FAQ

What is a reasonable payback period for an AI agent?
Six to nine months for operations replacement, three for revenue-side use cases. Anything past 12 months should be re-scoped or rejected.
How do I estimate cost before building?
Tokens per run, tool calls per run, retry multiplier (1.2x to 1.6x), observability and storage (5 to 15 percent of inference), human review time. Times expected volume. Plus 30 percent contingency.
When is an AI agent not worth building?
Cost inversion, asymmetric failure cost, or rare-task non-amortization. Three honest no-go cases.
What costs do people forget?
Retry overhead, observability storage, human-in-loop review, wrong-answer cost, and engineering iteration hours.
How do I defend the business case to a CFO?
Five slides: task today, design and cost stack, value stack, payback with sensitivity, kill criteria. The kill slide earns the budget.

Closing the loop

The honest math is the friend of agent adoption, not its enemy. Teams that ship the numbers including the inconvenient ones build trust faster than teams that ship cherry-picked case studies. Related: cost control tactics (the operational levers), cost models (the pricing axes), and agent economics from first principles.

Sources