Prompt engineering for AI agents is not the same craft as prompt engineering for chatbots. A chatbot prompt shapes one response. An agent prompt shapes a loop: the model picks a tool, reads the result, decides whether to call another tool, and eventually returns to the user. Every step is conditional on the last. The prompt has to survive every branch.
This guide is built from the patterns that hold up across Gravity production agents and the published guidance from Anthropic, OpenAI, and DeepMind. It covers the anatomy of a working system prompt, how to write tool descriptions, how to set stopping conditions, when to use few-shot examples, and how to evaluate a prompt change before it ships. The goal is a prompt you can put on a roadmap and not be embarrassed by in six months.
Why agent prompts are different
In a chatbot, the model sees a user message and produces an answer. The prompt sets persona, tone, and a few rules. The chain is short and the surface area is small. In an agent, the model sees a goal, a list of tools, and a running scratchpad of tool calls and results. It must decide which tool to call, when to call it, when to stop, and how to summarize. A single agent run can include 5 to 50 model invocations.
Three properties of this loop change how you write the prompt. First, the model rereads parts of the system prompt on every turn. Verbosity now multiplies cost. Second, the model picks tools using only their names and descriptions. A tool with a vague description is invisible. Third, errors compound. A small ambiguity in turn one becomes a wrong tool call in turn two and a hallucinated result in turn three. The Anthropic Building Effective Agents writeup (2024) and the OpenAI Practical Guide to Building Agents (2024) both emphasize these properties as the source of most production failures.
Anatomy of a production system prompt
A working system prompt for a production agent has five sections, in order. Reorder these and the model's tool-selection accuracy drops measurably in our internal evals.
1. Identity and scope
Two to four sentences. Who the agent is, who it serves, and the explicit boundary of what it can do. The boundary is the most important part; it lets the model refuse cleanly when a request falls outside.
2. Tool catalog with usage rules
Every tool gets a name, a one-line purpose, a parameter list with types, and a "when to use this" sentence. Many agent builders skip the "when to use" line. Without it, the model has to infer tool selection from the name alone, which fails on ambiguous tools.
3. Step plan or reasoning frame
For multi-step tasks, include a short numbered plan: "1) read the email, 2) check calendar, 3) draft response." The plan gives the model a stable structure to fall back on when intermediate results are surprising. For one-shot tasks, skip this section.
4. Stopping conditions
Explicit rules for when to stop calling tools and return to the user. Without stopping conditions, agents loop until they hit a turn limit. The most common stopping rule: "stop when the user's question is fully answered or when you need information only the user can provide."
5. Output format
What the final response to the user looks like. Markdown vs. JSON, structure, length. If the agent reports back to a downstream system, define the JSON schema and reject unknown fields in code.
Writing tool descriptions
Tool descriptions are where most agent prompt engineering effort actually pays off. The Anthropic tool-use documentation (2024) and OpenAI function-calling guide (2024) both recommend treating descriptions as miniature system prompts: dense, specific, with examples of correct calls.
A working tool description has four parts. Purpose: one sentence that ends with what the tool returns. When to use: a clause with the trigger condition. When NOT to use: at least one clause covering a common misfire. Example call: one or two literal parameter sets.
Example. A poor description: "Searches the user's email." A working description: "Searches the user's email for messages matching a query, returning subject, sender, and date for up to 20 results. Use when the user asks about email content. Do NOT use to send or modify emails; use send_email or modify_email for those. Example: search_email(query='invoice from Acme', max_results=10)."
The second version is 70 tokens vs. 5 for the first. With prompt caching that 70-token cost is paid once per cache cycle, not per call. The accuracy gain on tool selection in our internal evals is large; the marginal cost is small.
Stopping conditions
Agents that do not know when to stop are the second-most-common failure mode after wrong tool selection. The Anthropic agent design guidance (2024) and Google DeepMind multi-turn evaluation work (2024) both recommend explicit stopping rules in the system prompt plus a hard turn limit in code.
Three stopping rules cover most production agents. Information complete: stop when the user's question can be answered with the data gathered. Information missing: stop and ask the user when a required input is unknown after one search attempt. Error budget exceeded: stop and escalate when two tool calls in a row return errors. The error-budget rule is the unsung hero; without it, agents retry failing tools indefinitely.
Always pair prompt-level stopping rules with a hard turn cap in code. We use 10 turns as the default for personal-productivity agents and 20 for ops agents. Above the cap, the run halts and a notification fires.
Few-shot examples
Few-shot examples are tokens you pay for on every call. They are worth the cost when the desired behavior is hard to specify in rules alone, which is roughly: chained tool sequences, ambiguous user intent, and structured-output tasks with non-obvious formatting.
Three rules for using examples in agent prompts. Use 2 to 5. More than 5 rarely improves accuracy and adds latency. Cover the edge cases, not the common case. The model handles the common case fine; show it what to do when the input is malformed. Cache the prefix. Put the examples at the top of the system prompt with the static tool descriptions so prompt caching applies. Anthropic prompt caching cuts input cost on cached portions by up to 90 percent and latency by up to 85 percent; OpenAI prompt caching applies a 50 percent discount to cached input tokens (Anthropic, 2024; OpenAI, 2024).
Eval-gated prompt changes
The single discipline that separates production-grade agent prompts from hobby ones is the eval set. An eval set is a list of 50 to 200 traces with expected outcomes per agent capability. Every prompt change runs against the eval set; any drop of more than one or two points blocks the change.
What to put in the eval set. Happy path: 20 to 40 typical user requests with the expected tool sequence and final answer. Edge cases: 10 to 20 cases that exposed bugs in production. Adversarial: 5 to 10 prompt-injection or jailbreak attempts. Regression suite: every bug ever fixed has a permanent eval case.
For more on this discipline, see how we run 80+ tests per agent capability and how to test agents before deploy. The cost-economics overlap with AI agent cost optimization. Every avoided regression is an avoided runtime cost.
Common mistakes
Five mistakes show up in almost every first-version agent prompt.
Softness over specificity. "Use good judgment" tells the model nothing. Replace with a rule and a condition: "if the request mentions a date older than 30 days, confirm the date with the user first."
Hidden tool selection. The model is told to be "helpful" without a rule for which tool to pick when two could apply. Add an explicit decision tree in the system prompt.
No output schema. The final response varies in shape between runs. Specify the schema and validate in code.
Per-user system prompts. Inserting user-specific data at the top of the system prompt breaks prompt caching. Put user-specific data in the user message, not the system prompt.
No version control. Prompts get edited in a hosted UI, then nobody can roll back. Store the prompt in git, deploy with a hash, and log the hash on every run.
| Mistake | Fix | Effort |
|---|---|---|
| Vague tool descriptions | Add "when to use", "when NOT to use", example call | Low |
| No stopping rule | Add 3 stopping conditions + hard turn cap in code | Low |
| No eval set | Capture 50 traces from production, label expected outcomes | Medium |
| Per-user system prompt | Move user data to user message | Low |
| No version control | Store prompts in git, log hash per run | Medium |
Frequently asked questions
How is prompt engineering for AI agents different from chatbot prompting?
Chatbot prompts shape one response. Agent prompts shape a loop where the model picks tools, reads results, and decides when to stop. The system prompt must include tools, stopping rules, and recovery behavior.
What should be in a production agent system prompt?
Identity and scope, tool catalog with usage rules, step plan, stopping conditions, output format. In that order.
Should I use few-shot examples in agent prompts?
Use 2 to 5 examples when the desired tool sequence is not obvious from tool descriptions. Pair with prompt caching to amortize the token cost.
How do I evaluate an agent prompt change?
Build a 50 to 200 case eval set per capability. Run the new prompt against it. Block changes that drop pass rate by more than one or two points.
What is the biggest prompt mistake new agent builders make?
Soft instructions instead of specific rules. Replace "be careful" with concrete conditions and actions.
Three things to ship this week
- Rewrite every tool description with purpose, when-to-use, when-NOT-to-use, and an example call.
- Add three stopping conditions to your system prompt and a hard turn cap in code.
- Capture 50 traces from production and label expected outcomes. That is your starter eval set.
Sources
- Anthropic, "Building Effective Agents", 2024, anthropic.com
- Anthropic, "Tool use with Claude", 2024, docs.anthropic.com
- OpenAI, "A Practical Guide to Building Agents", 2024, openai.com
- OpenAI, "Function calling guide", 2024, platform.openai.com
- Anthropic, "Prompt caching with Claude", 2024, anthropic.com
- OpenAI, "API prompt caching", 2024, openai.com