How the AgentsDirectory Index works

The AgentsDirectory Index (ADI) is our own composite score. It combines the field's major coding and agent benchmarks into one number, so the ranking is not tied to any single vendor's leaderboard. ADI is not itself a benchmark: it aggregates them. As we publish our own evaluation (the AgentsDirectory Benchmark, ADB), it will fold in as one transparent Tier A input, with no change to how the index is read.

Per benchmark: We score each model as a percent of the field leader on that benchmark, so the best model on each board is 100 and everyone else is measured against it. That keeps a strong runner up high and a cheaper, decent model honest, instead of letting one outlier distort the scale.
Credibility weights: Each benchmark carries an editorial weight: Tier A 1, Tier B 0.7, Tier C 0.4. Stale boards drop to zero.
Recency gate: A benchmark counts for a model only if it is still testing current models and has a result for that model. Boards that stopped running the latest frontier models self-exclude.
Coverage gate: A model is ranked only once its results cover at least 40 percent of the active weight, so a single result can't reach the top.
Coverage adjustment: A model's score is its weighted average on the boards it ran, blended with the field average on the boards it skipped. So a model that only ran a flattering subset can't out-rank one tested across the board, while a clear leader that is merely undertested still stays near the top.
Composite: The index is that coverage-adjusted capability, rescaled so the current leader shows as 100. Value divides capability by a blended price of (3 times input plus output) over 4, for models above a capability floor. Coding runs the same math over coding benchmarks only.
Update cadence: Recomputed whenever benchmark results change. Last updated June 2026.

Included benchmarks

Tier A (1)Headline, well-run, frontier-current boards.

Artificial Analysis Intelligence Index
CursorBench 3.1
SWE-bench Verified

Tier B (0.7)Strong but narrower or slightly lagging boards.

DeepSWE
FrontierCode Diamond
FrontierCode Main
SWE-Lancer (IC Diamond)
Terminal-Bench 2.0

Tier C (0.4)Useful supporting signals.

BrowseComp
GAIA2
OSWorld-Verified
Tau2-Bench Telecom

Currently excluded: Aider Polyglot, Berkeley Function Calling Leaderboard V4, Design Arena, LiveCodeBench v6, METR 50% Time Horizon, SWE-bench Pro, Vending-Bench 2. These boards have stopped testing the latest frontier models, so they contribute nothing until they refresh.

The ADI value frontier

Every ranked model plotted as ADI capability against its blended price. Up and to the left is better value: the same capability for less money. The models hugging the top-left edge are the ones the Value tab rewards.

Frequently asked questions

What is the AgentsDirectory Index?

ADI is our own composite score for AI models. It combines the field's major coding and agent benchmarks into one number, so the ranking is not tied to any single vendor's leaderboard. We score each model as a percent of the field leader on every active benchmark, then take a credibility-weighted average and rescale it so the current leader shows as 100.

Is ADI a benchmark?

No. ADI is not a benchmark, it aggregates benchmarks. It does not run any new evaluation of its own; it combines results from boards like CursorBench, SWE-bench Verified, and the Artificial Analysis Intelligence Index. A separate first-party evaluation we plan to run, the AgentsDirectory Benchmark (ADB), would be a benchmark, and it would fold into ADI as one more Tier A input.

Why does a model lead Intelligence but not appear on Coding?

The coverage gate. A model is only ranked once its results cover at least 40 percent of the active weight for that view. A brand new model can lead the boards it has run while still missing too many of the coding-only boards to be ranked on the Coding tab. It rejoins automatically once those results land.

How is the Value score calculated?

Value divides a model's capability by its blended price, where blended price weights input tokens three to one against output (the Artificial Analysis convention). Only models above a capability floor are ranked, so a cheap but weak model can't top the board on price alone. The result is rescaled so the best value shows as 100.

How often is the index updated?

ADI is recomputed whenever benchmark results change. The current snapshot was last updated June 2026.

Back to the model rankings

Share:

Details:

Type
Composite index
Inputs
12 benchmarks
Scale
Leader = 100
Updated
June 2026

Included benchmarks

Tier A (1)Headline, well-run, frontier-current boards.

Artificial Analysis Intelligence Index
CursorBench 3.1
SWE-bench Verified

Tier B (0.7)Strong but narrower or slightly lagging boards.

DeepSWE
FrontierCode Diamond
FrontierCode Main
SWE-Lancer (IC Diamond)
Terminal-Bench 2.0

Tier C (0.4)Useful supporting signals.

BrowseComp
GAIA2
OSWorld-Verified
Tau2-Bench Telecom