"What can AI agents actually do?" is the question every non-developer buyer asks before the discovery call ends. The honest answer is more concrete than the marketing material and less impressive than the demo videos. Modern agents do a specific category of work well, a different category badly, and a third category not at all. Knowing which is which is the difference between a deployment that pays for itself and a procurement decision the buyer regrets six months in.
This post draws the boundary. It builds on the structural definitions at AI agent vs chatbot vs assistant and the loop mechanics at how AI agents work. The future hub at what is an autonomous AI agent sits above all of these.
The capability surface
The capability surface for agents in 2026 has three layers. The inner core is what works reliably enough to deploy: structured-data extraction, scheduled summarisation, ticket and email triage, multi-step API calls, light research with verification. The middle ring is what works with supervision: cross-system reconciliation, customer-facing replies that escalate to humans, integration glue between systems with messy schemas. The outer ring is what does not yet work outside research demos: open-ended planning across many domains, irreversible high-stakes decisions, tasks where common-sense reasoning matters more than pattern matching.
The boundary between the rings moves outward over time as models improve and tooling matures. The boundary moves slowly, though, and it does not move uniformly. Some categories that looked hard in 2024 are now solved (structured extraction, ticket routing). Some that looked solvable still are not (true open-ended research, multi-domain planning). Buyers should evaluate against the current boundary, not the projected one.
Concrete examples that work today
Inbox triage. An agent reads incoming email, classifies into reply-now / reply-later / archive / flag, drafts replies for the routine cases, and surfaces the rest. This is one of the most reliable production agent categories.
Lead enrichment. An agent receives a new lead, looks them up across LinkedIn, the company website, and a CRM, returns a structured profile with the relevant fields filled in. The category is reliable because the schema is fixed and the data sources are well-known.
Scheduled reports. An agent pulls data from one or more sources at a fixed cadence, runs a defined transformation, and produces a report in a defined format. Boring, valuable, reliable.
Customer-service tier-one deflection. An agent reads the user's message, looks up the relevant policy or knowledge base, drafts a response, and either answers directly for routine cases or escalates to a human. Reliable when the knowledge base is good; unreliable when the knowledge base is sparse.
Monitoring and remediation. An agent watches a system for a defined set of conditions, runs a defined remediation when those conditions fire, and escalates if the remediation does not work. Works well for narrow, well-understood incidents.
Light research. An agent receives a question, searches the web, reads the top results, synthesises an answer with citations. Works for well-bounded questions; less reliable for questions that require deep evaluation of source quality.
Where the boundary sits
The boundary is at the intersection of four properties: novelty, stakes, common sense, and tolerance for slowness. Novelty: agents work better on tasks where examples exist. Truly novel situations, where the agent has never seen a pattern resembling the input, expose the limits of pattern matching. Stakes: agents fail at irreversible high-stakes decisions because the failure cost dominates the productivity gain. Common sense: agents struggle when the task requires reasoning across multiple domains in the way a human would.
The fourth property is more subtle. Tolerance for slowness. When the cost of a wrong action exceeds the cost of slowing down to ask a human, agents lose. Surgical procedures, regulatory submissions, irreversible financial transactions: those should slow down. Agents in those categories are dangerous when used as agents and useful when used as assistants.
What the benchmarks say
Two benchmarks anchor the empirical answer. GAIA, the benchmark for general AI assistants, tests multi-step real-world reasoning. Top systems score in the high eighties on the easier tier and substantially lower on harder tiers as of 2025 results, while human performance sits above ninety-two percent. SWE-bench tests software engineering tasks; top systems solve roughly fifty to sixty percent of bugs as of late 2025, with the resolved rate continuing to climb but still well short of human ceiling.
The interpretation is honest: the headlines about agent capability are roughly correct on narrow benchmarks and roughly wrong on real-world breadth. A buyer who reads "agents solve 60% of SWE-bench" should read it as "agents close to half the bugs in a curated benchmark", not "agents replace developers". The gap between benchmark and deployment is where most procurement disappointment lives.
Tasks vs workers
The most important framing for a non-developer buyer is task-level, not worker-level. The question is not "can an agent replace this person?" The question is "which of the recurring digital tasks this person does could an agent take over with appropriate supervision?" The answer is usually a meaningful slice, not the whole job.
This framing is honest, calibrated, and easier to deploy. Pick a task, deploy an agent for it, supervise the output, expand if it works. McKinsey's 2024 work on AI productivity and the Volt Equity reports on agent adoption both reach the same conclusion: task-level adoption succeeds; worker-level replacement narratives over-promise. The economics post at economics of bootstrapped AI agents makes the per-task math explicit.
The three-check feasibility test
Three checks tell a buyer whether an agent will work for a specific use case. One: is the task recurring with a stable enough pattern that example inputs and outputs are easy to write? If yes, the model can pattern-match. If no, the task is research-level and not yet reliably automatable. Two: are the systems involved API-accessible or scriptable? Agents act through tools; if the tools do not exist, the agent cannot act. Three: is the cost of a wrong action low enough that human review on a sample is sufficient supervision? If yes, deploy. If no, do not deploy without per-action human approval, which usually means the agent abstraction is wrong for the job.
Three yeses: deploy. Two yeses: deploy with heavier supervision. One or zero yeses: this is not an agent job. Use a human, an assistant, or a workflow tool depending on the structure of the work. The workflow comparison at describe outcome, not workflow covers when workflow tools beat agents.
Gravity is built around tasks that pass all three checks. The 80-test methodology in how we test AI agents is the reliability layer for the deployment.
Frequently asked questions
What can an AI agent actually do in 2026?
Modern AI agents can run multi-step tasks across well-instrumented APIs: inbox triage, lead enrichment, scheduled reports, ticket routing, monitoring and remediation, light research, structured data extraction. The category that works best is recurring digital tasks where the rules are mostly clear and the systems are mostly well-behaved. Anything outside that envelope is a research project, not a deployment.
What can AI agents not do?
Agents cannot reliably handle truly novel situations, irreversible high-stakes decisions without human review, tasks requiring genuine common-sense reasoning across many domains, or workflows where the cost of a wrong action exceeds the cost of slowing down. The GAIA benchmark and SWE-bench results both show clear ceilings on real-world generalisation.
What is the GAIA benchmark for agents?
GAIA is a benchmark designed to test general AI assistants on real-world tasks that require reasoning, multi-step planning, and tool use. Top systems score around eighty-eight percent on the easier tier and lower on harder tiers as of 2025 results. Humans score above ninety-two percent. The gap is what tells buyers what is and is not yet automatable.
Can an AI agent replace a human worker?
An agent can replace tasks, not workers. The right framing is task-level: which recurring digital tasks does this person do that an agent could take over with appropriate supervision. The answer is usually a meaningful slice, not the whole job. Treating agents as worker replacements rather than task automators leads to over-promising and under-delivering.
How do I tell if my use case is feasible?
Three checks. Is the task recurring with a stable enough pattern that examples are easy to write. Are the systems involved API-accessible or scriptable. Is the cost of a wrong action low enough that human review on a sample is enough. If all three are yes, agent is feasible. If any is no, the agent will be unreliable, expensive, or both.
Three takeaways before you close this tab
- Agents are good at boring, recurring, well-instrumented work. That is the deployment-friendly zone.
- Benchmarks exceed deployments. A 60% SWE-bench agent does not replace a developer.
- Three checks before you commit. Stable pattern, API access, tolerable cost-of-error.
Sources
- GAIA benchmark, "GAIA: a benchmark for General AI Assistants", 2023, arxiv.org/abs/2311.12983
- SWE-bench, "Verified leaderboard", accessed 2026-05-05, swebench.com
- McKinsey, "The state of AI in 2024", mckinsey.com
- Volt Equity, "AI Agent State of the Market 2025", voltequity.com
- Anthropic, "Building effective agents", 2024, anthropic.com/engineering