A PoC tells you the technology works. A pilot tells you the deployment works. Most teams skip the pilot because the PoC succeeded; then production hits real volume, real users, and real operational concerns, and the rollout becomes a fire drill. The 90-day pilot is the bridge. Companion to PoC checklist, stakeholder buy-in, and migration planning.
PoC versus pilot
The distinction matters because the questions are different.
- PoC asks: can the technology solve this problem at acceptable quality? 4 to 6 weeks. Synthetic or sampled data acceptable. Engineering team, vendor, business owner involved.
- Pilot asks: does the solution work in our environment, with our users, at our volume, in a way we can operate? 60 to 90 days. Real volume. Real users. End-to-end including ops and support.
Skip the pilot and the production rollout becomes the pilot, with all the risk that implies. Run a pilot and the rollout becomes an expansion.
Three pilot phases
The 30-30-30 structure works for almost every agent pilot.
- Days 1 to 30: Controlled rollout. Small user group (5 to 25 users), tight monitoring, fast feedback loop. Goal: prove the operational pattern works.
- Days 31 to 60: Expansion. Broader pilot population (25 to 200 users). Goal: measure adoption, quality, and business impact at meaningful scale.
- Days 61 to 90: Measure and transition. Hold scale steady; measure rigorously; build the production transition plan.
Days 1 to 30: controlled rollout
The goal is to validate the operational pattern, not the technology (that was the PoC). Key activities.
- Recruit pilot users. 5 to 25 users who represent the target population. Mix of enthusiasts and skeptics. Skeptics catch what enthusiasts miss.
- Train pilot users. 30 to 60 minutes per user, with documented FAQs and a known escalation path.
- Daily monitoring. The team watches the dashboard daily. Issues surface and get fixed within 24 to 48 hours.
- Weekly review. Adoption, quality, and any surfaced concerns. Iterate prompts, tools, and integrations.
- Day-30 review with kill criteria. Quality, adoption, safety. Go or no-go to Phase 2.
Kill criteria for the day-30 review.
- Capability metric below 50 percent of target.
- Any safety breach not resolved by process change.
- Pilot users not returning week-over-week (more than 50 percent stopped using).
- Operational issues exceeding the team's capacity to fix in real time.
Days 31 to 60: expansion
Scale to 25 to 200 users, depending on use case. The new questions.
- Does quality hold at scale? Edge cases that did not appear in the small group show up.
- Does adoption follow a curve? Week 4 to week 8 is when "tried it once" either converts to "uses regularly" or "stopped using". Track active users weekly.
- Does the support load scale linearly? If support tickets grow faster than users, the agent is generating problems.
- Does the cost scale linearly? Surprises here are common; some use cases scale super-linearly because more users mean more retrievals on the same data, etc.
Expansion is also when stakeholder communication ramps up. Weekly updates to the steering group. A monthly executive snapshot. Bad news shared early; people forgive surprises that arrive small.
Days 61 to 90: measure and transition
The last 30 days are measurement and the production transition plan. Activities.
- Hold scale steady. No further user additions. Stable inputs for measurement.
- Final measurement. All four metric classes, comparing to baseline.
- User survey. Quantitative satisfaction and qualitative themes.
- Production transition plan. User onboarding flow, training material, support model, rollback plan, success metrics for first 90 days post-launch.
- Final go-no-go with executive sign-off. Day 90.
Pilot metrics
Four classes, tracked weekly throughout the pilot.
Adoption. Active users (used in last 7 days), runs per active user, week-over-week retention, time-to-first-value (how long from signup to first useful agent run).
Quality. Output quality scored against the rubric, error escalation rate (runs needing human review), satisfaction (NPS or simpler 5-star).
Business value. Time saved measured, errors prevented, revenue impact attributable. Compared to the baseline measured pre-PoC.
Operational. Uptime, p99 latency, cost per run, support ticket volume, on-call pages.
The Stanford AI Index report on enterprise AI adoption identifies failure to track adoption metrics as one of the top reasons pilots stall (Stanford AI Index, 2025). Adoption is the leading indicator; everything else is lagging.
Driving adoption
Five tactics that work.
- Time-to-first-value under 5 minutes. First successful use within 5 minutes of signup. After that the user has decided whether to come back.
- Embedded in existing workflow. The agent appears where users already work (Slack, email, the CRM), not as a separate tool to remember.
- Internal champion per cohort. A peer the cohort respects who uses the agent visibly. Champions drive 2 to 5 times the adoption of broadcast email.
- Office hours and feedback loop. Weekly 30-minute open Q&A. Users feel heard; you find issues fast.
- Visible improvements. Ship a small fix every week. Users see momentum; trust builds.
Pilot-to-production transition
The plan locked in week 11.
- Rollout sequence. Order of teams or user cohorts being onboarded. Aligned with executive sponsors per cohort.
- Training and onboarding material. Scalable to a much larger population: documentation, video, in-app guidance.
- Support model. Who owns user questions in production? Tier 1, tier 2, escalation path.
- Operations handoff. Pilot team continues to own for first 30 days; production ops takes over by day 60 post-launch.
- Success metrics for the first 90 days. Same metric classes as the pilot, with realistic targets.
- Rollback plan. If a class-1 issue surfaces, how is the rollout paused. Named decision-maker; documented criteria.
When to cancel a pilot
Cancellation is the right call when:
- The day-30 kill criteria fire.
- The day-60 measurement shows quality declining as scale grows.
- Users have voluntarily stopped using the agent.
- A material safety or compliance issue surfaces that cannot be mitigated within the pilot timeline.
- The TCO at scale, projected from pilot data, exceeds the business case.
Cancellation is not failure; it is the right outcome when the data says so. The cost of canceling a pilot is the pilot cost. The cost of converting a doomed pilot to production is the pilot cost plus the production rebuild plus the trust loss with users and stakeholders.
FAQ
- What is the difference between a PoC and a pilot for AI agents?
- A PoC validates that the platform can technically solve the problem in 4 to 6 weeks. A pilot validates that the solution works in production conditions over 60 to 90 days.
- How long should an AI agent pilot last?
- Sixty to ninety days. Less and you miss the second-month plateau; more and it becomes production by accident.
- What does a 90-day pilot timeline look like?
- Days 1-30 controlled rollout. Days 31-60 expansion. Days 61-90 measure and transition.
- What metrics matter in an agent pilot?
- Adoption, quality, business value, operational. Track weekly; adoption is the leading indicator.
- How do you handle pilot users who are frustrated?
- Listen, fix, communicate fixes. Pilot users surface real issues; treat them as collaborators.
- When should a pilot be canceled mid-way?
- If at day-30 capability is below 50 percent of target, safety has breaches, or users have stopped using. Cancellation is not failure.
Sources
- Stanford HAI, "AI Index Report", 2025, aiindex.stanford.edu
- MIT Sloan / BCG, "Expanding AI's Frontiers", 2024, sloanreview.mit.edu
- McKinsey, "The state of AI in 2024", 2024, mckinsey.com
- Forrester, "Total Economic Impact methodology", 2024, forrester.com
- Gartner, "Pilot to Production for AI Initiatives", 2024, gartner.com
