AI Agent for Salesforce Data Hygiene: How It Works

What this agent does

A CRM full of duplicates, missing fields, and stale records is the most expensive piece of software a revenue team owns, and the cheapest to fix if anyone has the time. Nobody has the time. A data-hygiene agent runs continuously and produces the queue RevOps would otherwise build manually once a quarter.

The agent does not merge records. It does not edit canonical fields directly. It does not delete anything. Its output is a prioritised review queue with rationale. RevOps acts on the queue.

For broader context, see what an AI agent can actually do. For the related CRM-side scoring use case, see AI agent for HubSpot lead scoring.

Salesforce permissions

Salesforce's permission model is generous and the agent should sit on the minimum profile that still works.

Connected App with OAuth. Refresh-token grant, with the app installed by an admin. The agent never uses a username-password flow.
Custom permission set. Read on Account, Lead, Contact, Opportunity, Task, Event. Write only on a single custom field named AI_Hygiene_Flag__c (and its sibling AI_Hygiene_Reason__c) on each object.
No master-record merge. The Merge permission on Account and Lead is human-only.
No delete. Delete is permanent in Salesforce and the agent should never have it.

API limits matter. Salesforce orgs have a 24-hour rolling limit on API calls based on edition and user count. For most production orgs this is in the high tens or low hundreds of thousands per day; per Salesforce's documentation, Enterprise edition starts at 1,000 calls per licensed user. A hygiene agent uses batched queries (SOQL with relationship traversal) and Bulk API 2.0 for the read pass, keeping daily call count well under the limit.

Duplicate detection in three layers

One layer of dedup catches the easy cases and misses the interesting ones. The agent uses three.

Domain match. Normalise the email domain (strip subdomains, ignore @+plus addresses, lowercase). Two accounts sharing a normalised domain are a candidate pair. Highest confidence; rarely false-positive.
Name match. Levenshtein distance on normalised company names (strip "Inc," "LLC," "Pvt Ltd," "GmbH"). Combined with SoundEx to catch transliteration variants. Threshold tuned per market because abbreviation conventions differ.
Embedding similarity. For pairs where domain and name diverge but the records describe the same entity (acquisition, rename, parent-subsidiary), embed each record's textual description and compute cosine similarity. Records above a threshold and inside a configurable industry filter become candidate pairs.

Every candidate pair carries the layer that flagged it and the confidence. RevOps reviewers handle high-confidence first; embedding-only candidates require more scrutiny and the queue surfaces this.

Missing-field policy

"Missing field" is a per-object configuration, not a universal rule. An Account without a stated industry is a hygiene issue if your reporting depends on industry; if it doesn't, it isn't. The agent reads the operator's hygiene policy from a Salesforce custom object (Hygiene_Policy__c) where RevOps lists which fields are required per object and the rationale.

For required fields with values, the agent does nothing. For required fields without values, the agent proposes an enrichment value (if a configured enrichment source returns one), writes it to Proposed_Industry__c or the equivalent field, and queues the record for review. The reviewer compares the proposed value to the source and either accepts or rejects.

The agent never overwrites a non-null canonical field. A reviewer changing a canonical value sees the proposed value as a suggestion in the side panel, not as a fait accompli.

Staleness rules

Stale records are the silent killer. An open opportunity with no activity in 60 days is almost always closed-lost in reality. A lead in "Working" status with no contact in two weeks is almost always cold.

The default thresholds the agent ships with:

Open Opportunity, no activity 30 days. Owner is reminded. Activity here includes any task, event, or stage change.
Working Lead, no activity 14 days. Owner is reminded. After 21 days the agent recommends status change but does not make it.
Account, no Contact created or modified in 90 days. Surfaced to the account owner.
Task overdue 7 days. Reminds the assignee daily until completion or reschedule.

All thresholds are configurable per object and per record type. RevOps owns the config.

Guardrails

Five guardrails make a hygiene agent safe to run continuously.

No merge. Merging records destructively combines history. Always human.
No delete. Salesforce delete is recoverable for 15 days, but recovery is operations toil and the agent does not need delete.
Read-write only on hygiene fields. Custom permission set is the enforcement.
Tier-1 audit log. Every proposal records timestamp, source object, target field, proposed value, and reason. SOC 2 reviewers ask for this trail.
Bulk-write throttle. Even on hygiene fields, the agent caps writes at 1,000 per hour to avoid triggering Salesforce's API limits or trigger storms in custom Apex.

Common mistakes

Letting the agent merge. The pull is real. Merging takes time and the agent could do it quickly. It will be wrong, and Salesforce merges are very hard to undo. Stick to the queue.

Treating fuzzy name match as gospel. "Acme" and "Acme Holdings" are different legal entities in most jurisdictions. Domain match catches them as one because they share an email domain, which is wrong. Layer the rules.

Enriching from one source only. A single enrichment vendor will be down, will be stale, or will be wrong for your geography. Configure two and only propose when both agree, or escalate to a human review.

Closing opportunities automatically. A pipeline number that goes down because the agent closed dormant deals looks like the wrong thing happened, because it usually is. Surface, do not close.

Treating staleness as binary. 31 days is not meaningfully different from 29 days. The agent should rank rather than threshold for the review queue, and the threshold becomes a sort key, not a gate.

Running the agent against Sandbox once, then forgetting. A clean Sandbox run is not evidence the agent is safe in production. The data shape is different, the volume is different, the trigger logic is different. Validate in Sandbox, then run for a full week in production with writes disabled, comparing the agent's proposed changes against a manual sample of 50 records. Only then enable hygiene-field writes.

Mixing hygiene with enrichment. Hygiene (finding duplicates, missing fields, stale records) and enrichment (filling fields from external sources) have different risk profiles. Hygiene is mostly safe because it surfaces. Enrichment writes potentially-wrong external data. Keep them as two separate agents with separate review queues, even if they share infrastructure. A single combined queue means a reviewer skipping ten hygiene flags also skips an enrichment proposal that needed real attention.

Frequently asked questions

What does a Salesforce data hygiene agent actually do?

It runs against Salesforce on a schedule, finds duplicate accounts and leads, surfaces records with missing required fields, and identifies records that have not been touched within the staleness window. It does not merge records, edit owners, or delete anything. Output is a prioritised review queue handed to RevOps with the rationale for every flag.

Which Salesforce permissions does the agent need?

A Salesforce connected app with OAuth scopes for API access and refresh tokens, plus a custom permission set that grants read access on accounts, leads, contacts, and opportunities and write access only on a hygiene-flag custom field. Master-record merging is never granted to the agent; that requires a human.

How does the agent detect duplicate accounts?

Three layers in order: exact match on normalised domain, fuzzy match on company name with Levenshtein and SoundEx, and an embedding-similarity pass on the textual description for accounts where domains and names diverge but the entity is the same. Each layer's confidence is recorded so reviewers see why a pair was flagged.

Does the agent enrich missing fields automatically?

Only with sources the operator has explicitly configured (Clearbit, ZoomInfo, internal billing system, public web). Enrichment writes go to a separate 'proposed enrichment' field, not to the canonical field directly. RevOps approves or rejects the enrichment before it overwrites a manually entered value.

How does the agent define 'stale'?

Stale rules are owner-configured per object. Default thresholds: an Open opportunity with no activity in 30 days, a Working lead with no activity in 14 days, an account with no contact in 90 days. Stale records are flagged for the owner; the agent never closes opportunities or downgrades lead status itself.

Three takeaways before you close this tab

Surface, never merge. The agent builds the queue. RevOps decides.
Three-layer dedup. Domain first, then name, then embedding.
Proposed enrichment field. Canonical fields are sacred.

Sources

Salesforce Developer, "API request limits and allocations", retrieved 2026-05-11, developer.salesforce.com/api-limits
Salesforce Developer, "Bulk API 2.0 reference", retrieved 2026-05-11, developer.salesforce.com/api_asynch
AICPA, "SOC 2 Type II Trust Services Criteria, CC8.1 Change Management", retrieved 2026-05-11, aicpa-cima.com/soc-2
NIST, "SP 800-53 AC-6 Least Privilege", retrieved 2026-05-11, csrc.nist.gov/sp800-53/AC-6
Aryan Agarwal, "Gravity CRM-agent guardrails", internal v1, May 2026, About

The same shape, applied to other tools and surfaces:

AI agent for HubSpot lead scoring, the scoring-side CRM pattern.
AI agent for expense categorisation, the finance-side analogue of chart-discipline.
AI agent for cold lead follow-up, the outbound-sequencing companion.
AI agent for Mailchimp segmentation, audience taxonomy on the marketing side.
AI agent safety and guardrails, the principles every CRM-touching agent respects.
AI agent tool use explained, how an agent gets connected to Salesforce.
How we test AI agents with 80 tests per capability, the calibration methodology.