AI Agent for Expense Categorisation: How It Works

What this agent does

An expense-categorisation agent does the same job a bookkeeper does in QuickBooks or Xero, but for every transaction the company posts, not just the ones that get to month-end. It reads the corporate card feed, reads the OCR'd receipts attached to expense reports, and assigns each transaction to a category in the chart of accounts.

It does not write to the general ledger on day one. It posts to a staging table that the bookkeeper reviews. Once accuracy is calibrated, transactions matching a high-confidence rule auto-post with an audit-trail entry, and the rest still go to the reviewer queue. This is the same pattern used by Brex, Ramp, and the Xero machine-learning categoriser, but tuned for a specific company's chart of accounts.

For a related general pattern, see AI agent for inbox triage. For the cluster context, see what an AI agent can actually do.

Sources of expense data

Three streams feed the agent.

Corporate card transactions. Pulled via the issuer's API (Brex, Ramp, Mercury, Stripe Issuing). Each transaction carries merchant name, MCC (Merchant Category Code, the four-digit ISO 18245 code the card network assigns), amount, currency, and posting date. The MCC is the single highest-signal feature for first-pass classification.

Receipt images. Photographed or emailed receipts, OCR'd via the OCR service the platform integrates (Google Cloud Document AI's expense parser, Amazon Textract's AnalyzeExpense, or the receipt-OCR mode in the agent's underlying model). The OCR output gives line items, taxes, and merchant address.

Submitted expense reports. A user submits a report with a category they have picked from a UI dropdown. This category is a label, not a fact. The agent uses it as a hint, not as ground truth, because users habitually pick "office supplies" for anything they cannot otherwise classify.

The agent reconciles the three streams. A submitted report should match a card transaction within a 7-day window if it is a card-based expense. If it does not, the agent flags the report and asks for the supporting card transaction or marks it as reimbursable cash.

Fixed chart of accounts

The chart of accounts is the closed label set the classifier uses. It is the same chart the company's accountant uses to file taxes and produce financial statements.

For a US-incorporated SaaS company filing under US GAAP, the chart typically contains 40 to 80 leaf accounts spread across five top-level groups: Assets, Liabilities, Equity, Revenue, and Expenses. Expense categorisation only touches the Expenses subtree, but it has to be exact: classifying a software subscription as "Office Supplies" instead of "Software & SaaS" makes the SaaS line on the income statement wrong.

For an Indian private limited company filing under Ind-AS, the chart aligns to Schedule III of the Companies Act and is materially different from US GAAP. The agent has to be configured per jurisdiction, not generically.

Free-text categorisation (letting the user or the classifier invent a category) breaks at month-end. A bookkeeper closing the books cannot reconcile "office misc," "office supplies," and "office stationery" without manual merge. Use the chart that exists. Do not invent a parallel taxonomy.

How the classifier works

The classifier is a chain of three steps. The first two are deterministic. The third is the language-model fallback.

Step 1: MCC rule lookup. A merchant-category-code-to-account map handles the long tail of unambiguous merchants. MCC 5812 (eating places and restaurants) maps to "Meals & Entertainment." MCC 5732 (electronics stores) maps to "Office Equipment" or "Computer Hardware" depending on the chart. Roughly 60 to 70% of corporate card transactions land here in our testing.

Step 2: Merchant memory. A per-organisation cache of merchant string to account, learned from confirmed categorisations. The first time someone marks a "Datadog" charge as "Software & SaaS," the agent remembers it. The next "Datadog" charge gets the same category without asking. This step adds another 15 to 20%.

Step 3: Language-model classifier. For the remainder, the agent calls a constrained-output language model with the chart of accounts as the allowed label set, the merchant, MCC, OCR line items, and the user's hint. The output is a category plus a confidence score. Anything below the configured threshold (we default to 0.85) gets flagged to the bookkeeper.

Constrained output here is critical. Letting the model emit free-text categories means the classifier sometimes returns "Misc Office," which does not exist in the chart. JSON-schema-constrained generation (or, equivalently, function-call mode) forces the output to be one of the defined accounts.

Duplicate and fraud checks

Two failure modes account for most expense-system pain: duplicate submissions and policy violations the bookkeeper has to chase down later. The agent catches both.

Duplicate fingerprint. A composite key over merchant, absolute amount, posting date, and last four of the card. Two submissions matching across all four within a rolling 30-day window are flagged. The honest case is a card transaction the user also submitted as a reimbursable. The dishonest case is a duplicate reimbursement claim. Both need a human glance before posting.

Policy fingerprint. The company's expense policy is a list of rules: per-meal cap, no alcohol, no first-class, no personal items. The agent encodes the rules as line-level checks against OCR output. A meal receipt with one $90 entree and a bottle of wine over a $250 cap gets flagged with the specific line that broke policy. The user can dispute, the bookkeeper can override, and the audit trail records who did what.

Audit trail matters. SOC 2 Type II auditors, our own among them, ask for the per-transaction trail of who categorised, who approved, and who posted. The agent records all three plus the classifier confidence and the rule fired.

Guardrails

Five guardrails keep the agent out of trouble.

No auto-post above the materiality threshold. Any transaction over a configurable amount (we default to 0.5% of monthly revenue) requires a bookkeeper review before posting, regardless of classifier confidence.
Quarterly drift audit. A bookkeeper samples 50 auto-posted transactions per quarter, recategorises them blind, and compares. If agreement drops below 95%, auto-post pauses until the model is retrained on the corrected sample.
No tax-line guesses. Sales tax, GST, and VAT recovery have specific source-document requirements (the tax must appear on the receipt). The agent reads the OCR'd tax line, never infers it from the total.
Frozen chart at month-end. The chart of accounts is read-only between the 28th and the 5th of the following month, the bookkeeper's close window. The agent batches new categories proposed during this window and surfaces them for review on the 6th.
Per-vendor allow-list for auto-post. Even if the classifier is confident, transactions from vendors not on the allow-list go to review for the first 30 days. The allow-list grows as the bookkeeper approves recurring vendors.

For the broader principle, see AI agent safety and guardrails.

Common mistakes

Auto-posting from day one. The temptation is to write straight to the general ledger because the classifier "feels accurate." It is not on day one. Run in staging-only mode for 30 days. Compare against the bookkeeper's manual categorisations on the same data.

Treating MCC as ground truth. Merchant Category Codes are assigned by the card network and are wrong often enough to notice. A coworking space billed under a real-estate MCC, a consulting fee billed under a generic "professional services" MCC. The MCC is a hint, not a label.

Ignoring multi-currency. A USD-denominated company that pays a EUR invoice with a corporate card sees both currencies on the same statement. The agent has to record both and pick the FX rate from a defined source. Inventing a rate at classification time creates a reconciliation mismatch at month-end.

Free-text categories sneaking in. If the dropdown the agent or the user picks from is editable, free-text categories will appear. Lock the dropdown to the chart. Make new categories a request that goes to the bookkeeper, not a self-service action.

Skipping the audit trail. The agent's value to a finance team is not just the time saved; it is the per-transaction record of who did what. Record classifier confidence, rule fired, reviewer, and timestamp. Without these, the agent is a black box at audit time.

Frequently asked questions

What does an expense-categorisation agent actually do?

It ingests three sources, corporate card transactions, OCR'd receipts, and submitted expense reports, then maps each line to a category in the company's chart of accounts. Low-confidence matches are flagged, not guessed. Duplicates and out-of-policy items are routed to the finance reviewer. It does not auto-post journal entries to the general ledger in the first 30 days.

Why use a fixed chart of accounts instead of free-text categories?

Free-text categories drift across months and across employees, which makes reconciliation slow and tax filings error-prone. A fixed chart of accounts, derived from the accounting standard the company files under (US GAAP, IFRS, or Indian Ind-AS), gives the classifier a closed label set and the bookkeeper a single source of truth at month-end.

How accurate does the classifier need to be before the agent posts entries?

Agreement with the bookkeeper has to be above 95% across 200 recategorised transactions before the agent moves from suggest-only to auto-post with reviewer approval. Auto-post without approval is not appropriate at any accuracy below 99%, since a single mis-classified transaction can cascade into wrong sales tax recovery.

Does the agent handle multi-currency and foreign-exchange?

Yes, but it does not invent FX rates. It pulls rates from the source-of-truth the company already uses (the card issuer's posted rate, Xero's daily rate, or a configured FX feed) and records both the original currency and the converted amount on the journal line. Where the rate is ambiguous, the transaction is flagged.

How does the agent prevent duplicate expense submissions?

It computes a fingerprint over merchant, amount, date, and last four of the card. If two submissions match across all four fields within a 30-day window, the second is automatically held and the submitter is asked to confirm. Genuine duplicates (corporate card swipe plus an employee-submitted reimbursement for the same meal) are caught here.

Three takeaways before you close this tab

Closed label set or nothing. Free-text categories are the enemy of close.
Chain three classifiers. Rule, memory, model. In that order.
Audit trail is the product. Confidence, rule, reviewer, timestamp on every line.

Sources

ISO 18245, "Merchant Category Codes", retrieved 2026-05-11, iso.org/standard/79450
Google Cloud Document AI, "Expense parser overview", retrieved 2026-05-11, cloud.google.com/document-ai/processors
AICPA, "SOC 2 Type II Trust Services Criteria, CC8.1 Change Management", retrieved 2026-05-11, aicpa-cima.com/soc-2
Ministry of Corporate Affairs (India), "Schedule III of the Companies Act, 2013", retrieved 2026-05-11, mca.gov.in/companiesact2013
Gravity team, "Gravity expense-agent guardrails", internal v1, May 2026, About

The same shape, applied to other tools and surfaces:

AI agent for invoice chasing, the AR side of the same finance workflow.
AI agent for Stripe failed-payment recovery, dunning with guardrails.
AI agent for Salesforce data hygiene, the CRM analogue of chart-of-accounts discipline.
AI agent safety and guardrails, the principles every finance-touching agent needs.
AI agent tool use explained, how an agent gets connected to a bookkeeping or card-issuer API.
How we test AI agents with 80 tests per capability, the calibration methodology.
AI agent failure modes, the cases an expense agent has to defend against.