How the AgentsDirectory Index works
The AgentsDirectory Index (ADI) is our own composite score. It combines the field's major coding and agent benchmarks into one number, so the ranking is not tied to any single vendor's leaderboard. ADI is not itself a benchmark: it aggregates them. As we publish our own evaluation (the AgentsDirectory Benchmark, ADB), it will fold in as one transparent Tier A input, with no change to how the index is read.
- Per benchmark
- We score each model as a percent of the field leader on that benchmark, so the best model on each board is 100 and everyone else is measured against it. That keeps a strong runner up high and a cheaper, decent model honest, instead of letting one outlier distort the scale.
- Credibility weights
- Each benchmark carries an editorial weight: Tier A 1, Tier B 0.7, Tier C 0.4. Stale boards drop to zero.
- Recency gate
- A benchmark counts for a model only if it is still testing current models and has a result for that model. Boards that stopped running the latest frontier models self-exclude.
- Coverage gate
- A model is ranked only once its results cover at least 40 percent of the active weight, so a single result can't reach the top.
- Coverage adjustment
- A model's score is its weighted average on the boards it ran, blended with the field average on the boards it skipped. So a model that only ran a flattering subset can't out-rank one tested across the board, while a clear leader that is merely undertested still stays near the top.
- Composite
- The index is that coverage-adjusted capability, rescaled so the current leader shows as 100. Value divides capability by a blended price of (3 times input plus output) over 4, for models above a capability floor. Coding runs the same math over coding benchmarks only.
- Update cadence
- Recomputed whenever benchmark results change. Last updated June 2026.
Included benchmarks
- Artificial Analysis Intelligence Index
- CursorBench 3.1
- SWE-bench Verified
- DeepSWE
- FrontierCode Diamond
- FrontierCode Main
- SWE-Lancer (IC Diamond)
- Terminal-Bench 2.0
- BrowseComp
- GAIA2
- OSWorld-Verified
- Tau2-Bench Telecom
Currently excluded: Aider Polyglot, Berkeley Function Calling Leaderboard V4, Design Arena, LiveCodeBench v6, METR 50% Time Horizon, SWE-bench Pro, Vending-Bench 2. These boards have stopped testing the latest frontier models, so they contribute nothing until they refresh.
Every ranked model plotted as ADI capability against its blended price. Up and to the left is better value: the same capability for less money. The models hugging the top-left edge are the ones the Value tab rewards.
What is the AgentsDirectory Index?
ADI is our own composite score for AI models. It combines the field's major coding and agent benchmarks into one number, so the ranking is not tied to any single vendor's leaderboard. We score each model as a percent of the field leader on every active benchmark, then take a credibility-weighted average and rescale it so the current leader shows as 100.
Is ADI a benchmark?
No. ADI is not a benchmark, it aggregates benchmarks. It does not run any new evaluation of its own; it combines results from boards like CursorBench, SWE-bench Verified, and the Artificial Analysis Intelligence Index. A separate first-party evaluation we plan to run, the AgentsDirectory Benchmark (ADB), would be a benchmark, and it would fold into ADI as one more Tier A input.
Why does a model lead Intelligence but not appear on Coding?
The coverage gate. A model is only ranked once its results cover at least 40 percent of the active weight for that view. A brand new model can lead the boards it has run while still missing too many of the coding-only boards to be ranked on the Coding tab. It rejoins automatically once those results land.
How is the Value score calculated?
Value divides a model's capability by its blended price, where blended price weights input tokens three to one against output (the Artificial Analysis convention). Only models above a capability floor are ranked, so a cheap but weak model can't top the board on price alone. The result is rescaled so the best value shows as 100.
How often is the index updated?
ADI is recomputed whenever benchmark results change. The current snapshot was last updated June 2026.
Type
Composite indexInputs
12 benchmarksScale
Leader = 100Updated
June 2026