What this agent does

A CRM full of duplicates, missing fields, and stale records is the most expensive piece of software a revenue team owns, and the cheapest to fix if anyone has the time. Nobody has the time. A data-hygiene agent runs continuously and produces the queue RevOps would otherwise build manually once a quarter.

The agent does not merge records. It does not edit canonical fields directly. It does not delete anything. Its output is a prioritised review queue with rationale. RevOps acts on the queue.

For broader context, see what an AI agent can actually do. For the related CRM-side scoring use case, see AI agent for HubSpot lead scoring.

Salesforce permissions

Salesforce's permission model is generous and the agent should sit on the minimum profile that still works.

API limits matter. Salesforce orgs have a 24-hour rolling limit on API calls based on edition and user count. For most production orgs this is in the high tens or low hundreds of thousands per day; per Salesforce's documentation, Enterprise edition starts at 1,000 calls per licensed user. A hygiene agent uses batched queries (SOQL with relationship traversal) and Bulk API 2.0 for the read pass, keeping daily call count well under the limit.

Duplicate detection in three layers

One layer of dedup catches the easy cases and misses the interesting ones. The agent uses three.

  1. Domain match. Normalise the email domain (strip subdomains, ignore @+plus addresses, lowercase). Two accounts sharing a normalised domain are a candidate pair. Highest confidence; rarely false-positive.
  2. Name match. Levenshtein distance on normalised company names (strip "Inc," "LLC," "Pvt Ltd," "GmbH"). Combined with SoundEx to catch transliteration variants. Threshold tuned per market because abbreviation conventions differ.
  3. Embedding similarity. For pairs where domain and name diverge but the records describe the same entity (acquisition, rename, parent-subsidiary), embed each record's textual description and compute cosine similarity. Records above a threshold and inside a configurable industry filter become candidate pairs.

Every candidate pair carries the layer that flagged it and the confidence. RevOps reviewers handle high-confidence first; embedding-only candidates require more scrutiny and the queue surfaces this.

Missing-field policy

"Missing field" is a per-object configuration, not a universal rule. An Account without a stated industry is a hygiene issue if your reporting depends on industry; if it doesn't, it isn't. The agent reads the operator's hygiene policy from a Salesforce custom object (Hygiene_Policy__c) where RevOps lists which fields are required per object and the rationale.

For required fields with values, the agent does nothing. For required fields without values, the agent proposes an enrichment value (if a configured enrichment source returns one), writes it to Proposed_Industry__c or the equivalent field, and queues the record for review. The reviewer compares the proposed value to the source and either accepts or rejects.

The agent never overwrites a non-null canonical field. A reviewer changing a canonical value sees the proposed value as a suggestion in the side panel, not as a fait accompli.

Staleness rules

Stale records are the silent killer. An open opportunity with no activity in 60 days is almost always closed-lost in reality. A lead in "Working" status with no contact in two weeks is almost always cold.

The default thresholds the agent ships with:

All thresholds are configurable per object and per record type. RevOps owns the config.

Guardrails

Five guardrails make a hygiene agent safe to run continuously.

Common mistakes

Letting the agent merge. The pull is real. Merging takes time and the agent could do it quickly. It will be wrong, and Salesforce merges are very hard to undo. Stick to the queue.

Treating fuzzy name match as gospel. "Acme" and "Acme Holdings" are different legal entities in most jurisdictions. Domain match catches them as one because they share an email domain, which is wrong. Layer the rules.

Enriching from one source only. A single enrichment vendor will be down, will be stale, or will be wrong for your geography. Configure two and only propose when both agree, or escalate to a human review.

Closing opportunities automatically. A pipeline number that goes down because the agent closed dormant deals looks like the wrong thing happened, because it usually is. Surface, do not close.

Treating staleness as binary. 31 days is not meaningfully different from 29 days. The agent should rank rather than threshold for the review queue, and the threshold becomes a sort key, not a gate.

Running the agent against Sandbox once, then forgetting. A clean Sandbox run is not evidence the agent is safe in production. The data shape is different, the volume is different, the trigger logic is different. Validate in Sandbox, then run for a full week in production with writes disabled, comparing the agent's proposed changes against a manual sample of 50 records. Only then enable hygiene-field writes.

Mixing hygiene with enrichment. Hygiene (finding duplicates, missing fields, stale records) and enrichment (filling fields from external sources) have different risk profiles. Hygiene is mostly safe because it surfaces. Enrichment writes potentially-wrong external data. Keep them as two separate agents with separate review queues, even if they share infrastructure. A single combined queue means a reviewer skipping ten hygiene flags also skips an enrichment proposal that needed real attention.

Frequently asked questions

What does a Salesforce data hygiene agent actually do?

It runs against Salesforce on a schedule, finds duplicate accounts and leads, surfaces records with missing required fields, and identifies records that have not been touched within the staleness window. It does not merge records, edit owners, or delete anything. Output is a prioritised review queue handed to RevOps with the rationale for every flag.

Which Salesforce permissions does the agent need?

A Salesforce connected app with OAuth scopes for API access and refresh tokens, plus a custom permission set that grants read access on accounts, leads, contacts, and opportunities and write access only on a hygiene-flag custom field. Master-record merging is never granted to the agent; that requires a human.

How does the agent detect duplicate accounts?

Three layers in order: exact match on normalised domain, fuzzy match on company name with Levenshtein and SoundEx, and an embedding-similarity pass on the textual description for accounts where domains and names diverge but the entity is the same. Each layer's confidence is recorded so reviewers see why a pair was flagged.

Does the agent enrich missing fields automatically?

Only with sources the operator has explicitly configured (Clearbit, ZoomInfo, internal billing system, public web). Enrichment writes go to a separate 'proposed enrichment' field, not to the canonical field directly. RevOps approves or rejects the enrichment before it overwrites a manually entered value.

How does the agent define 'stale'?

Stale rules are owner-configured per object. Default thresholds: an Open opportunity with no activity in 30 days, a Working lead with no activity in 14 days, an account with no contact in 90 days. Stale records are flagged for the owner; the agent never closes opportunities or downgrades lead status itself.

Three takeaways before you close this tab

Sources

The same shape, applied to other tools and surfaces: