AI Agent Benchmarks Explained: GAIA, SWE-bench, AgentBench, BFCL, ToolBench

The benchmark landscape for AI agents in 2026 is busier than the buyer landscape can absorb. Five benchmarks dominate the conversation: GAIA, SWE-bench, AgentBench, BFCL, and ToolBench. Each measures something different. Vendors cite the one they look best on. Buyers reading vendor pages without context come away thinking the field is more uniform than it is. This piece is the buyer-side guide: what each benchmark actually measures, where each is useful, where each is misleading, and how to read leaderboards honestly.

The honest summary upfront: no single benchmark captures whether an agent will work for your task. Benchmarks are necessary signal, not sufficient signal. Read them in combination with the evaluation framework covered in AI agent evaluation metrics.

Five benchmarks, mapped to what they measure

Each benchmark covers a different slice. GAIA is the broadest; BFCL is the narrowest. None covers everything.

GAIA: general assistant tasks

GAIA evaluates whether an AI agent can perform tasks a competent human can, across web browsing, file reading, multi-step reasoning, and tool use. It has three difficulty levels: Level 1 is doable in a few steps, Level 2 requires more reasoning chains, Level 3 is multi-step research with multiple tools (Mialon et al., GAIA, arXiv:2311.12983, 2023).

GAIA's value is its honesty. Humans pass at 90%+ across all levels; the strongest evaluated AI agent systems sit below 50% on the harder levels. The gap is the most useful single number in agent evaluation: it captures the reality that even capable models on capable agent loops are not yet at human performance for general assistant tasks. Vendor claims that imply otherwise are running into GAIA pretty quickly.

What GAIA does not measure: reliability under repeated runs of the same task, refusal correctness, cost per task. The benchmark is single-shot capability, not production reliability. The 80-test methodology in how we test AI agents picks up where GAIA leaves off.

SWE-bench: code-writing on real issues

SWE-bench evaluates AI agents on real software engineering tasks pulled from GitHub issues. The agent gets a repository and an issue; it must produce a patch that resolves the issue and passes the repository's test suite. The benchmark has variants (SWE-bench Lite, SWE-bench Verified) that filter for tractable cases (SWE-bench leaderboard).

SWE-bench is the gold-standard benchmark for code-writing agents. It tests multi-step reasoning, codebase understanding, tool use (running tests, navigating files), and persistence across tool calls. Strong SWE-bench performance is a real signal that the agent loop, the tool layer, and the model integration are working together.

Caveat: SWE-bench task pass rate drops sharply with each additional tool call required, which means agents that score well on average can fail on the harder long-horizon issues. Read the breakdown by issue complexity, not just the headline number.

AgentBench: broad multi-domain

AgentBench is a multi-environment benchmark that evaluates agents across eight environments: operating system, database, knowledge graph, card game, lateral thinking puzzles, house-holding, web shopping, and web browsing. The breadth is the value; an agent that scores well across AgentBench is unlikely to be over-fit to one task class.

The trade-off is depth. AgentBench tests broad capability but does not push hard on any single domain the way SWE-bench pushes on code or BFCL pushes on tool selection. Read AgentBench as a generalist signal; do not read it as the last word on any specific use case.

BFCL: function-calling accuracy

The Berkeley Function Calling Leaderboard (BFCL) measures how accurately a model selects and invokes the right function for a given query. It includes scenarios for simple function calling, parallel calls, multiple tools to choose from, and adversarial cases where the right answer is "no function should be called". BFCL is published by the Berkeley LMSYS group as part of their model evaluation work.

BFCL is narrow but important. The function-calling layer is where most agent loops fail in production: the model picks the wrong tool, calls the right tool with malformed arguments, or hallucinates parameters. High BFCL scores correlate with strong tool-use accuracy, which is necessary (though not sufficient) for agent reliability. The metric ties directly to the input-variation and tool-failure categories in the 80-test methodology.

ToolBench: tool use at scale

ToolBench evaluates agents on tool use at scale, with thousands of APIs in the available tool space. Where BFCL tests narrow function-calling accuracy, ToolBench tests whether the agent can navigate a much larger tool space and pick the right one for a given task. Strong ToolBench performance suggests the agent's tool-selection layer scales beyond a small handful of tools.

Both BFCL and ToolBench leave a gap that production agents have to fill: real tools have failure modes (rate limits, schema drift, partial responses) that the benchmarks do not fully capture. Strong scores are necessary but not sufficient for production tool reliability.

How to read leaderboards honestly

A few habits that separate buyers who get value from benchmarks from buyers who get fooled by them.

Read multiple benchmarks together

A vendor that scores well on GAIA and SWE-bench and BFCL is a much stronger signal than a vendor that scores well on one. Strong performance on a single benchmark is consistent with optimisation against that benchmark; strong performance across three covering different competencies is much harder to fake.

Watch for benchmark leakage

Benchmark test sets get reused; over time, they leak into model training data, and scores creep up for reasons that have nothing to do with capability improvements. The community responds by releasing new versions (SWE-bench Verified is a newer cleaner subset, for example). Look for whether vendors are reporting on the latest version, not the original.

Pair benchmarks with reliability methodology

Benchmarks measure capability ceiling; production reliability needs more than the ceiling. The 80-test methodology covered in how we test AI agents tests reliability under variation, which benchmarks do not. A vendor with strong benchmarks and no published reliability methodology is signalling capability without commitment to delivery.

Read benchmark scores against task economics

A 50% pass rate on GAIA Level 3 means agents fail half the time on multi-step tasks. Whether that is acceptable depends on the cost of failure, the cost of re-running, and the value of success. The economics framework in economics of bootstrapped AI agents is the right lens for that translation.

The benchmarks are useful and getting more useful. They are not the whole picture. The discipline I learned across three startups is that any single number can be made to flatter; the test of honesty is whether the publisher also publishes the unflattering numbers next to it. Buyers who learn to ask for both ship better.

Frequently asked questions

What are the main AI agent benchmarks?

Five widely cited benchmarks: GAIA (general AI assistant tasks across tool use and reasoning), SWE-bench (software engineering tasks on real GitHub issues), AgentBench (broad multi-domain agent evaluation), BFCL (Berkeley Function Calling Leaderboard, focused on tool-use accuracy), and ToolBench (large-scale tool-use across many APIs). Each measures something different.

What does GAIA measure for AI agents?

GAIA measures whether an AI agent can perform tasks a competent human can, across web browsing, file reading, multi-step reasoning, and tool use. The benchmark has three difficulty levels. Human pass rates are above 90 percent on all levels; the strongest evaluated AI agent systems are below 50 percent on the harder levels, which is why GAIA is a useful reality check.

What is SWE-bench?

SWE-bench evaluates AI agents on real software engineering tasks pulled from GitHub issues. The agent is given a repository and an issue and asked to produce a patch that resolves the issue and passes the repository's test suite. It is the gold-standard benchmark for code-writing agents and a hard test of multi-step reasoning.

What is the Berkeley Function Calling Leaderboard?

BFCL evaluates how accurately a model selects and invokes the right function (tool) for a given query, across simple, parallel, and multiple-tool scenarios. It is narrower than GAIA or SWE-bench but very useful for evaluating the tool-use layer of an agent. High BFCL scores indicate the model picks tools correctly; they do not guarantee the agent succeeds end-to-end.

Should buyers trust AI agent benchmark scores?

Trust them as one signal among several. Benchmarks can be optimised against, the test set leaks over time, and benchmark performance does not equal production reliability. Read benchmark scores in combination with reliability methodology, refusal correctness, and task economics. A vendor with strong benchmarks and a credible 80-test methodology is a stronger signal than benchmarks alone.

Three takeaways before you close this tab

Five benchmarks dominate. GAIA (general), SWE-bench (code), AgentBench (multi-domain), BFCL (tool selection), ToolBench (scale).
Read multiple benchmarks together. Single-benchmark dominance is consistent with optimisation against that benchmark.
Pair benchmarks with reliability methodology. Benchmarks measure capability ceiling, not production reliability.

Sources

Mialon et al., "GAIA: A Benchmark for General AI Assistants", arXiv:2311.12983, 2023, retrieved 2026-05-07, arxiv.org/abs/2311.12983
SWE-bench, "Leaderboard", retrieved 2026-05-07, swebench.com
Liu et al., "AgentBench: Evaluating LLMs as Agents", arXiv:2308.03688, 2023, retrieved 2026-05-07, arxiv.org/abs/2308.03688
Berkeley Function Calling Leaderboard (BFCL), retrieved 2026-05-07, gorilla.cs.berkeley.edu/leaderboard.html
Qin et al., "ToolBench", arXiv:2307.16789, 2023, retrieved 2026-05-07, arxiv.org/abs/2307.16789