How much can prompt caching actually save?

Cached input tokens are typically priced at roughly one-tenth the rate of fresh input tokens on Anthropic and OpenAI, and Google's context caching prices vary similarly. For agents with a long static system prompt and tool catalog, 60 to 90 percent reduction in input-token cost is common after caching is fitted to the prompt structure.

When should I downgrade to a smaller model?

When evaluations on your actual task distribution show the smaller model passes within a tolerance band on the metrics you care about. Downgrade gated by evals, never by gut. Re-run the evals on every prompt change.

What is a runaway agent and how do I stop it?

A runaway agent is one that loops or fans out beyond intended bounds. Stop it with a platform-side cap on total tool calls per run, a hard token budget, and a circuit breaker that halts on suspicious cost-per-step velocity.

How do I attribute cost per user or tenant?

Tag every model and tool call with run-id, agent-id, tenant-id, user-id, and capability-id. Aggregate on the warehouse. The OpenTelemetry GenAI semantic conventions standardize the trace shape so attribution survives a vendor swap.

Is batching worth it for agents?

Yes when latency budgets permit. Batched API tiers on OpenAI and Anthropic price input at roughly half the synchronous rate, with 24-hour service-level expectations. Best for background evaluation runs, embedding rebuilds, and offline analytics.

How do I keep context windows from bloating cost?

Trim retrieved context to top-k results with a relevance floor. Summarize prior turns on a sliding window. Move long-lived state into structured memory, not the prompt. Verify with a token-per-turn metric in the trace.

AI Agent Cost Control: 9 Production Tactics That Cut Spend 40-90%

This is the operational companion to AI agent cost models explained. That post covers the pricing axes vendors charge along. This one covers the tactics that reduce the bill in production. Most teams discover the bill before the discipline; the tactics below are what teams reach for when the monthly invoice surprises the CFO.

Nine tactics, ordered roughly by leverage. Prompt caching is usually first because the payoff is high and the engineering cost is low. Model routing and eval-gated downgrades come next, and together they account for most of the headline 40 to 90 percent reductions you will see in vendor case studies. Batching, context discipline, runaway stops, and cost attribution close the list. None of these require swapping vendors. All of them require an eval harness you trust.

Where the money actually goes in a production agent

Token cost is the obvious meter, but in any agent running for more than a week it is rarely the dominant one. Stripe transfers, paid search APIs, premium data providers, SMS sends, vendor-side reranking, and the storage and egress of long traces add up faster than people expect. The line item that ate the most budget at most teams in 2024 to 2025 was retry overhead: failed calls that consumed tokens, the retry of the same call, and the human-in-loop review that the retry triggered. That is three line items charged on one transaction.

Three rules before the tactics. First, instrument before you optimize. You cannot cut a number you cannot see. Second, optimize cost-per-task, not cost-per-call. A 50 percent cheaper call that needs to run three times is a 50 percent more expensive task. Third, the cheapest token is the one you never send. Most of the savings here come from not sending tokens, not from buying cheaper ones.

Tactic 1: Prompt caching

Anthropic prompt caching prices cached input tokens at roughly 10 percent of the standard rate, with a five-minute default TTL and a one-hour extended TTL on supported tiers (Anthropic, 2025). OpenAI similarly discounts cached input tokens, and Google's context caching on Gemini follows the same shape (OpenAI, 2025; Google AI for Developers, 2025). For an agent with a long static system prompt, tool catalog, and few-shot examples, fitting the cache boundary to those static blocks routinely cuts input-token cost 60 to 90 percent.

Place the static system prompt and tool definitions at the top of the message list, before any per-request content.
Mark the cache breakpoint at the end of the static block.
Keep per-request content below the breakpoint, short, and deterministic where possible.
Measure cache-read versus cache-write token counts in the trace. If write ratio is > 20 percent on hot agents, the boundary is wrong.

Tactic 2: Model routing

Most agents do not need the largest model for every step. A planner step might genuinely need Opus-class reasoning; a tool-arg extraction step does not. Routing means choosing the smallest acceptable model per step. The classifier is small, the savings are large, and the implementation lives at the agent framework level, not the model API.

Classify each agent step by required reasoning depth, instruction-following demand, and tolerance for hallucination.
Map step classes to model tiers in a routing table that lives in code, not in a wiki.
Default route is the cheaper model; the more expensive model requires an explicit reason in the trace.
Re-evaluate routing on every prompt change. A new instruction can shift a step's class.

Tactic 3: Eval-gated downgrades

Downgrading from Sonnet to Haiku, or GPT-4o to GPT-4o-mini, can cut step cost by 5x to 12x. The risk is silent quality regression. The defense is a held-out eval suite that runs on every model swap and on every prompt change. Anthropic, OpenAI, and Google all publish evaluation guidance that says the same thing: do not change models without re-running task-specific evals (Anthropic Evaluations, 2025).

Build a labeled eval set from real (or replayed) production traffic. 100 to 500 examples per task class is usually enough.
Define pass thresholds per metric (exact match, F1, judge-model agreement) before swapping.
Run evals on a candidate model. If pass rate is within the agreed band, downgrade. If not, keep the larger model.
Re-run evals on every prompt change. Drift can re-open a regression that was closed last week.

Tactic 4: Batching and async

Batched API tiers on Anthropic and OpenAI price input at roughly half the synchronous rate, in exchange for a 24-hour service-level expectation. For background work like nightly evaluation runs, embedding rebuilds, document re-classification, and offline analytics, batching is a near-free 50 percent cut on the tokens it covers. Latency-sensitive paths stay on the synchronous tier.

Identify workloads that can tolerate a 24-hour SLA. These are almost always background, not user-facing.
Move them to the batch endpoint. Track success rate; batch tiers do occasionally return partial failures.
Keep an idempotency key per batch item so re-runs are safe.
Surface batch progress in the same dashboard as synchronous runs. The team will forget about batches otherwise.

Tactic 5: Context window discipline

Long context windows are a billing trap. A 200K-token window does not cost you 200K tokens; it costs you whatever you put in it on every call. The tactic is to keep the prompt as short as the task requires. Long-lived state moves into structured memory, not the prompt. Retrieved chunks are top-k with a relevance floor, not "everything that matched".

Set a per-turn token budget in code. Block calls that exceed it without an explicit override.
Trim retrieved context to top-k with a similarity floor. Below the floor, the chunk does not help.
Summarize prior turns on a sliding window. Old turns become a short summary in a system message.
Move durable state (user preferences, prior decisions) into a memory store; reference IDs in the prompt, not full content.

Tactic 6: Stopping runaway agents

The most expensive bug in agent code is a loop that should have stopped. A misclassified retry-able error, an over-eager planner, a tool that returns "try again" instead of "give up", and the budget evaporates over a weekend. The same controls described in the blast radius post cover this; the cost-specific cuts are below.

Hard cap on total tool calls per run, enforced at the platform layer.
Hard cap on total token spend per run, enforced before the next call fires.
Circuit breaker on cost-per-step velocity. If the last 10 steps each cost more than the median, halt.
Per-tenant daily ceiling with alerts at 50, 75, and 90 percent. Not just at 100 percent.

Tactic 7-9: Cost attribution, structured outputs, and tool-call hygiene

Three smaller-but-compounding tactics. First, cost attribution. Every model and tool call carries run-id, agent-id, tenant-id, user-id, capability-id, and step-id. OpenTelemetry's GenAI semantic conventions standardize the trace shape so attribution survives a vendor swap (OpenTelemetry GenAI, 2025). The warehouse aggregates; finance gets a per-tenant view.

Second, structured outputs. Force the model to return JSON conforming to a schema. Saves tokens (no rambling), saves retry rounds (fewer parse failures), saves a follow-up call to "clarify what you meant".

Third, tool-call hygiene. Idempotency keys on every write tool prevent paid retries from charging twice. Pre-validation of tool arguments fails cheap rather than charging the model to discover its own typo. And: never let a tool return text the model has to re-summarize; return the structured result and the model can quote it directly.

FAQ

How much can prompt caching actually save?: Cached input tokens are priced at roughly one-tenth the fresh rate across Anthropic, OpenAI, and Google. For agents with a long static system prompt and tool catalog, 60 to 90 percent input-token reduction is common after fitting the cache boundary.
When should I downgrade to a smaller model?: When evals on your task distribution show the smaller model passes within the agreed tolerance band. Eval-gated. Re-run on every prompt change.
What is a runaway agent and how do I stop it?: A loop or fan-out beyond intended bounds. Stop with platform-side caps on total tool calls per run, hard token budgets, and a circuit breaker on cost velocity.
How do I attribute cost per user or tenant?: Tag every call with run-id, agent-id, tenant-id, user-id, capability-id. Aggregate on the warehouse. OpenTelemetry GenAI conventions standardize the trace shape.
Is batching worth it for agents?: Yes when latency permits. Batched tiers price input at roughly half synchronous, with 24-hour SLA. Background work only.
How do I keep context windows from bloating cost?: Trim retrieved context to top-k with a relevance floor. Summarize prior turns. Move durable state into a memory store referenced by ID.

Closing the loop

The order of impact for a typical production agent: caching first, routing and eval-gated downgrades second, runaway stops and attribution alongside, batching and context discipline rounding out. The order of difficulty is roughly the opposite. Start where the leverage is highest and the eval pain is lowest. Related reading: how vendors price agent runs, agent economics from first principles, and the cost vs ROI framing for the CFO conversation.

Sources

Anthropic, "Prompt caching", 2025, docs.anthropic.com
OpenAI, "Prompt caching", 2025, platform.openai.com
Google AI for Developers, "Context caching", 2025, ai.google.dev
Anthropic, "Develop tests and evaluations", 2025, docs.anthropic.com
OpenTelemetry, "GenAI semantic conventions", 2025, opentelemetry.io