AI Agent for Help Scout Conversation Tagging

Key takeaways

Three tags max. Per conversation. From a fixed taxonomy.
Twenty to forty canonical tags. Fewer is too coarse, more breeds duplicates.
Tag, never reply. Two different problems. Two different agents.
Add-only. The agent never removes a tag. Removal is human work.
Weekly report. What customers actually ask about, ranked by tag.

What this agent does

Help Scout supports tags but does not enforce them. Over time a support inbox accumulates hundreds of tags, most of them near-synonyms applied by different agents in different weeks. Searching the tag cloud becomes impossible. Reporting on themes becomes unreliable. New agents do not know which tag to apply, so they invent one.

The agent fixes the discipline problem in two steps. First, at setup, it deduplicates the existing tag soup into a canonical taxonomy of twenty to forty tags. You approve the taxonomy. Second, going forward, every new conversation gets between one and three tags from the taxonomy, applied within two seconds of the conversation landing in Help Scout. Existing tags applied by humans stay. The agent never overrides a person.

What the agent does not do: reply, change the conversation status, assign, close, or notify customers. The single write action is "add tag". Anything else is out of scope. The same scoping logic shows up under how an agent triages tickets in Zendesk, picking one verb and sticking with it is what keeps the agent boring and useful.

Sources of truth

Help Scout, plus a flat YAML file with the taxonomy. Nothing else.

The first message of the customer thread. Used for classification. Subsequent messages do not change the tags.
The mailbox the conversation arrived in. Used as a routing hint when the taxonomy has mailbox-specific tags (Sales, Support, Billing).
The customer's metadata in Help Scout. Plan tier, lifetime value, segment. Used only if the taxonomy includes tier-based tags.
The taxonomy YAML. Stored in your Git repo or a Gravity workspace. Loaded at the start of each run. Edits are version-controlled.

The agent does not read attachments. Most attachments in a support context are screenshots and the classifier cannot reliably tag from images. Files that matter for classification are rare enough that human review handles them faster than the agent could.

The taxonomy is the product

The single decision that determines whether this agent is useful is the taxonomy. Twenty tags is too coarse; the report at the end of the week says half the inbox is Other. Sixty tags is too fine; the same conversation could plausibly take three of them and consistency collapses. The sweet spot is between twenty and forty tags grouped into four to six themes.

A taxonomy that works for a B2B SaaS support inbox looks like:

Theme: Account. Tags, signup, password reset, sso, mfa, profile.
Theme: Billing. Tags, pricing, plan change, refund, invoice, payment failure.
Theme: Product. Tags, feature request, bug report, integration setup, api question, performance.
Theme: Sales. Tags, demo request, pricing inquiry, partnership, enterprise.
Theme: Escalation. Tags, angry, churn risk, executive, legal, security incident.

Five themes, twenty-five tags. Every theme has a single tag for unclassifiable conversations within that theme (Other-Account, Other-Billing, etc), and the agent applies an Other tag only when no theme tag fits. Other-tag rate is the most useful health metric for the taxonomy. Above ten percent means the taxonomy is missing tags or themes. Below two percent means the taxonomy is too coarse to be useful. The agent reports this number in the weekly report so you tune the taxonomy from data, not vibes. The same observation appears in how to evaluate agent metrics.

Output: tags on the ticket, report at the end of the week

Two outputs. The first is invisible until you open Help Scout, every new conversation comes pre-tagged. The second is a weekly report posted to a Slack channel you nominate.

The report has four sections.

Volume by tag. A ranked list. Top ten tags this week and their week-over-week change.
Volume by theme. A smaller ranked list. Five or six rows.
Other-tag rate. One number. Plus the three most common unclassified phrases, pulled from the conversations the agent left in Other.
Drift. Tags whose volume changed by more than fifty percent week-over-week. Used as an early-warning signal that something shipped or broke.

The report is not interactive. It is a digest. If a manager wants to investigate a drift signal, they click into Help Scout and filter by tag. The agent does not provide a chat interface for asking questions about the report.

Guardrails

Two guardrails are non-negotiable.

Add-only on tags. The agent's API access is configured to call only the add-tag endpoint. The remove-tag endpoint is blocked at the OAuth scope level. This means even if a future version of the agent had a bug that wanted to remove tags, it could not.
Fixed taxonomy. The agent loads the taxonomy YAML once at the start of each run and refuses to apply tags outside it. The setup process to add a new tag is to edit the YAML, commit, and let the agent reload on the next conversation. There is no in-product "let me invent a tag" path.

The agent rate-limits its writes to under one tag per second across all conversations to stay well below Help Scout's API quota. Per-conversation it caps at three tags. The cap exists because four-and-up tag conversations turn the inbox into noise.

Common mistakes

Letting the agent invent tags. Every team that allows this ends up with a thousand-tag cloud inside six months. The agent must work from a fixed list. New tags happen in a quarterly taxonomy review. For the reasoning behind a fixed list, see how to limit an agent's actions.

Tagging on every reply. The agent looks at the first customer message. Replies change context but rarely change the right classification, and re-tagging on every reply produces tag thrash. If the conversation genuinely shifts topic, a human re-classifies.

Confusing tags with macros. A tag is a label. A macro is a response. The agent applies tags. It does not select macros. The Help Scout integrations to send canned responses live in a separate workflow run by a different agent if you want them.

Treating the Other-tag rate as failure. A small Other rate is healthy. It means the agent is not over-fitting. Pressuring the agent to zero out Other produces overconfident tags that misclassify and degrade the report.

Skipping the weekly report. A tagging agent without a report is invisible. The weekly digest is the proof that tagging is improving over time and the trigger for the taxonomy reviews. Without it, the agent's value is hard to defend. The pattern is the same one argued in how to monitor agent activity.

Frequently asked questions

Can an AI agent auto-tag Help Scout conversations?

Yes. The agent subscribes to new conversations through the Help Scout webhook, reads the customer's first message, and applies up to three tags from a fixed taxonomy that you define. It does not invent new tags. It does not retag agent replies. New conversations are tagged within two seconds of arrival.

Why not let the agent reply to conversations as well?

Tagging and replying are different problems. Tagging is cheap, reversible, and benefits from machine consistency. Replying is high-stakes, harder to reverse, and benefits from human nuance. Separating the two keeps the tagging agent simple and keeps reply quality in human hands. Other agents handle reply drafts if you want them.

How does the agent learn our tag vocabulary?

Setup imports your current Help Scout tags, deduplicates near-synonyms, and produces a proposed taxonomy of twenty to forty canonical tags. You approve the taxonomy. The agent applies tags from that list only. Adding a new tag requires editing the taxonomy file; the agent will not create tags on its own.

What about conversations in languages other than English?

The agent tags conversations in any language Help Scout accepts. The taxonomy itself stays in English. The agent reads the original text, classifies it against the taxonomy, and applies the English tag. Customer-facing text is never translated by the agent.

Can the agent untag conversations that humans tagged wrong?

No. The agent only ever adds tags. It never removes tags applied by a human or by a previous run. If two runs produce a duplicate tag, Help Scout deduplicates server-side. Removal of misapplied tags is left to humans for the same reason replies are.

Three takeaways before you close this tab

The taxonomy is the product. Twenty to forty canonical tags grouped into four to six themes.
Add-only on writes. Removal is a human decision and stays a human decision.
The weekly report is how the agent earns its keep. Drift signals come from the report, not from the inbox view.

Sources

Help Scout. Mailbox API, conversation events and tags. Tier 1.
Help Scout. Webhooks reference, conversation.created event. Tier 1.
Help Scout. OAuth2 scopes documentation. Tier 1.
Help Scout. API rate limits. Tier 1.