New

Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.

Rankings

Independent model, benchmark, and agent rankings for AI coding, showing what actually leads right now. Refreshed regularly.

Updated June 2026

Best models for your agent

Best models by AgentsDirectory Index

Overall capability across every active benchmark, weighted by credibility.

#ModelProviderOpenADICursorBenchContextInput price

1
Claude Fable 5
ProviderAnthropic
Open—
ADI100
CursorBench72.9%
Context1M
Input price$10/M
2
Claude Opus 4.8
ProviderAnthropic
Open—
ADI97.3
CursorBench63.8%
Context1M
Input price$5/M
3
GPT-5.5
ProviderOpenAI
Open—
ADI97.2
CursorBench64.3%
Context1.05M
Input price$5/M
4
GPT-5.4
ProviderOpenAI
Open—
ADI90.2
CursorBench—
Context1.05M
Input price$2.5/M
5
Claude Opus 4.7
ProviderAnthropic
Open—
ADI89.7
CursorBench64.8%
Context1M
Input price$5/M
6
GPT-5.2
ProviderOpenAI
Open—
ADI87.9
CursorBench—
Context400K
Input price$1.75/M
7
MiniMax-M3
ProviderMinimax
Open—
ADI82.4
CursorBench—
Context205K
Input price$0.3/M
8
Gemini 3.1 Pro Preview
ProviderGoogle
Open—
ADI77.9
CursorBench—
Context1.049M
Input price$2/M
9
Claude Sonnet 4.6
ProviderAnthropic
Open—
ADI77.8
CursorBench49%
Context1M
Input price$3/M
10
Gemini 3 Flash
ProviderGoogle
Open—
ADI76.3
CursorBench—
Context1.049M
Input price$0.5/M
11
Kimi K2.6
ProviderMoonshot
Open
ADI73.3
CursorBench47.6%
Context262K
Input price$0.66/M
12
GPT-5.4 Mini
ProviderOpenAI
Open—
ADI72.2
CursorBench—
Context400K
Input price$0.75/M
13
Kimi K2.5
ProviderMoonshot
Open
ADI65.3
CursorBench31.9%
Context262K
Input price$0.375/M

The AgentsDirectory Index (ADI) is our own composite score: it combines the field's major coding and agent benchmarks, so the ranking is not tied to any one vendor's leaderboard. ADI is not itself a benchmark, it aggregates them.

How the index works

Benchmark leaderboards

Artificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.

LeaderClaude Fable 559.9

SWE-bench Verified

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.

LeaderClaude Fable 595%

DeepSWE

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.

LeaderGPT-5.570%

FrontierCode Main

Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.

LeaderClaude Fable 546.3%

CursorBench 3.1

Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.

LeaderClaude Fable 572.9%

Aider Polyglot

The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.

LeaderGPT-588%

Berkeley Function Calling Leaderboard V4

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.

LeaderClaude Opus 4.577.47%

BrowseComp

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.

LeaderGPT-5.5 Pro90.1%

Design Arena

A crowdsourced Elo arena for AI-generated design and frontend code. Models go head to head on the same prompt (websites, UI components, games, mobile apps, SVG), and human votes set the rating. Higher Elo is better.

LeaderGLM-5.21360 Elo

FrontierCode Diamond

The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.

LeaderClaude Opus 4.813.4%

GAIA2

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

LeaderClaude Opus 4.657%

LiveCodeBench v6

Contamination-free competitive programming: problems are continuously collected from LeetCode, AtCoder and Codeforces after model cutoffs and scored as pass@1. Higher is better.

LeaderDeepSeek V493.5%

METR 50% Time Horizon

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.

LeaderClaude Opus 4.6718.81 min

OSWorld-Verified

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

LeaderClaude Fable 585%

SWE-bench Pro

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.

LeaderGPT-5.459.1%

SWE-Lancer (IC Diamond)

OpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.

LeaderGPT-5.3-Codex81.4%

Tau2-Bench Telecom

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

LeaderClaude Opus 4.699.3%

Terminal-Bench 2.0

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.

LeaderGPT-5.584.7%

Vending-Bench 2

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.

LeaderClaude Opus 4.68017.59 USD

Top AI coding agents

Agent

HermesSkills, Integrations & Self-hosting for Hermes
Claude CodeSkills, Plugins & MCP Servers for Claude Code
CodexSkills & MCP Servers for OpenAI Codex
OpenClawSkills & Automation for OpenClaw

Want real usage numbers? See the full agent leaderboard, ranked by tokens routed through OpenRouter and refreshed daily.

New