AI Agent Economics Explained: Unit Costs, Margins, Pricing

An agent that runs ten thousand times a day is a different business from one that runs ten times. Pricing pages do not reflect this and most founders learn it after they ship. This post walks through the actual per-token costs in May 2026, what a single agent run costs end-to-end, where margins go to die, why prompt caching matters more than people think, and the three pricing patterns that survive contact with real usage.

The numbers come from public provider pricing pages: Anthropic's Claude Sonnet 4 list price (Anthropic Pricing, retrieved 2026-05-09), OpenAI's GPT-4.1 and GPT-4.5 pricing (OpenAI API Pricing, retrieved 2026-05-09), and Google's Gemini 2.0 pricing (Google AI Pricing, retrieved 2026-05-09). Token costs have fallen roughly 4x year-over-year through 2024 and 2025; the slope has flattened in 2026 but the trend continues.

The per-token reality in 2026

Frontier model pricing in May 2026 (per million tokens):

Model	Input	Cached input	Output
Claude Sonnet 4	$3.00	$0.30	$15.00
Claude Opus 4	$15.00	$1.50	$75.00
GPT-4.1	$2.00	$0.50	$8.00
Gemini 2.0 Pro	$1.25	$0.31	$5.00
Llama 3.3 70B (self-host estimate)	$0.50	n/a	$0.75

The output token rate is what dominates total cost on most agent tasks. Input tokens accumulate faster (the system prompt + context + tool definitions all live in input) but output tokens are 4x to 6x more expensive. A back-of-envelope rule that holds in 2026: cost per run is roughly equal to the output token count multiplied by the output rate, plus 20 percent.

Unit economics of a single run

Take a customer-support draft-reply agent on Claude Sonnet 4. Per run:

System prompt: 2,000 tokens (cached)
User message + context: 1,500 tokens (uncached input)
Tool definitions: 800 tokens (cached)
Reasoning + draft output: 600 tokens (output)

Cost per run: (2,800 cached × $0.30) + (1,500 uncached × $3.00) + (600 output × $15.00) per million = $0.00084 + $0.0045 + $0.009 = roughly $0.014 per run before any retries.

That is the floor. Real agents loop, call tools, and retry. A more realistic profile for the same task with two tool calls and one revision pass lands around $0.06 to $0.09 per run. Multi-agent variants of the same task hit $0.30 to $1.50 per run. The variance is the difference between profitable and not.

The cluster post on agent cost models walks through cost projection per task type. The economics of bootstrapped agents piece covers the company-level economics for an indie founder.

Where margins compress

Three margin pressures recur across agent companies:

Long-context tasks. Agents that need 50,000 token contexts are 25x more expensive on input than agents that need 2,000 token contexts, and the margin compression is direct because input tokens are not output. Document-analysis agents, research agents, and code refactor agents all sit in this bucket.

Multi-agent architectures. As covered in multi-agent systems explained, multi-agent designs inflate token usage by 5x to 15x. The Anthropic research team reported 15x for their research-agent (built-multi-agent-research-system, retrieved 2026-05-09). Multi-agent should be reserved for tasks where the additional cost is genuinely justified.

Retries and verification. Reliability work, the 80-test discipline covered in how we test AI agents, requires the agent to verify its own output before exiting the loop. Verification adds 30 to 60 percent to per-run cost. Worth it; still margin.

The fourth, less obvious pressure is provider price tier shifts. When a model is reclassified into a higher tier or a new tier emerges (Sonnet became Opus's predecessor; GPT-4.5 launched between 4 and 5), prices on the previous tier do not always drop. Agents pinned to a deprecated model can see costs rise even as new models get cheaper.

Prompt caching changes the math

Prompt caching lets the provider serve repeated input prefixes at a fraction of the standard rate. Anthropic's prompt caching is $0.30 per million versus $3.00 per million standard for Sonnet 4: a 90 percent discount on the cached portion. OpenAI and Google have similar mechanisms with different cache TTL and miss policies.

For an agent with a 5,000 token system prompt that runs 1,000 times a day, naive cost is 5,000 × 1,000 × $3.00 / 1,000,000 = $15 per day on system prompt alone. With caching, that drops to $1.50 per day, freeing 90 percent of the budget for output tokens that produce actual customer value.

The catch: cache TTLs are short (5 minutes default for Anthropic). Agents that run sporadically may pay the full rate often enough to make the discount nominal. The fix is keep-alive logic that maintains the cache between bursts, which adds complexity but pays for itself within hours.

The three pricing patterns that survive

Per-seat with usage cap. $20 to $100 per user per month, includes N runs, hard cap above. Predictable for both sides. Caps usage and creates a buyer-side pressure to optimise prompts down. Lindy and several mid-market agents use this. The risk is that power users hit the cap and churn or self-throttle.

Pay-per-action. $0.10 to $1.00 per completed action. Aligned with the underlying cost structure. Penalises predictability and creates buyer anxiety; many buyers refuse to deploy agents on metering pricing because they cannot forecast their bill. Used in API-first products and developer tools.

Hybrid. $30 to $50 per seat, includes 500 runs, $0.20 per run above the threshold. Combines predictability with protection against margin compression on heavy users. The dominant emerging pattern in 2026 across agent SaaS. Gravity uses a variant of this for the post-waitlist tier.

Whatever pattern you pick, transparency on costs is now a competitive feature. Buyers in 2026 expect to see input rates, output rates, and per-run cost visible in the dashboard. Vendors who hide the meter lose enterprise deals on procurement review.

Why most agent companies are not yet profitable

The honest list:

Pricing inertia. SaaS pricing trained buyers on flat per-seat fees. Agent costs are per-task. The mismatch produces under-priced contracts that look great on revenue charts and bleed margin.
Multi-agent overuse. Marketing rewards "multi-agent" as a feature; engineers build it because product asked; CFO discovers the bill three quarters later.
Retry tax. Reliability work is right but expensive. Companies that under-invest in reliability win short-term margin and lose long-term trust; companies that over-invest win trust and lose short-term margin.
Customer success cost on net new categories. Buyers do not yet know how to operate agents. Heavy CS load through year one consumes the margin that should fund R&D.
Token costs are dropping but not fast enough. 4x year-over-year through 2025 was helpful but the pricing pages did not drop 4x. The benefit accrued mostly to providers and to vendors with the leverage to renegotiate.

Companies that have hit profitability share a profile: single-agent designs, aggressive prompt caching, hybrid pricing, customer-managed setup (low CS load), and a tight SKU tier that limits usage at the lowest plan. The shape is more SaaS than usage-based startup.

Frequently asked questions

What does an AI agent actually cost to run?

A typical task-completing agent on Claude Sonnet 4 costs $0.05 to $0.50 per run as of May 2026, depending on context size and tool calls. The 2025 to 2026 trend has been a 4x cost decrease per token at constant capability. The variable that dominates is context size, not model choice.

What are typical gross margins on an agent SaaS product?

Gross margins for agent products in 2026 range from 40 percent at the low end (heavy multi-agent or research tasks) to 80 percent at the high end (single-agent operational tasks with cached prompts). The industry median sits around 60 percent. Compare to traditional SaaS at 75 to 85 percent. Margins compress fastest on long-context tasks because tokens scale roughly linearly with context length.

Why are most agent companies not profitable yet?

Three reasons: customers expect SaaS-style flat pricing while costs are per-task, model providers raise prices when capability tiers shift, and multi-agent architectures inflate token usage by 5x to 15x. Companies that hit profitability early use single-agent designs, prompt caching, and pricing that includes a per-action component.

What is prompt caching and why does it matter for economics?

Prompt caching lets the LLM provider charge a reduced rate for input tokens that match a previously seen prefix. Anthropic's prompt caching reduces cached input by up to 90 percent at $0.30 per million versus $3.00 per million standard. For an agent that reuses a long system prompt across calls, prompt caching shifts gross margin by 10 to 20 percentage points.

How should I price an agent product?

Three patterns work: per-seat with a usage cap (predictable but caps usage), pay-per-action (aligned with cost but penalises power users), and hybrid (a base seat plus usage above a threshold). The hybrid is the dominant emerging pattern in 2026 because it preserves predictability while protecting margins on heavy users.

Three takeaways before you close this tab

The variable that decides margin is context size, not model choice.
Prompt caching is free margin if your system prompt is stable.
Hybrid pricing is the survivor. Per-seat plus usage above threshold.

Sources

Anthropic Pricing, retrieved 2026-05-09, anthropic.com/pricing
OpenAI API Pricing, retrieved 2026-05-09, openai.com/api/pricing
Google AI Pricing, retrieved 2026-05-09, ai.google.dev/pricing
Anthropic, "How we built our multi-agent research system", retrieved 2026-05-09, anthropic.com/engineering/built-multi-agent-research-system
Anthropic, "Prompt caching", retrieved 2026-05-09, docs.anthropic.com/en/docs/build-with-claude/prompt-caching