Agents Directory
Browse
Skills
Rankings
Agents
Categories
Models
Benchmarks
Compare
Agent Leaderboard
Skills
Rankings
Agents
About
New
Work in progress
: Agents Directory has just launched. Stay tuned, more content is on the way.
/
Benchmarks
AI coding benchmarks
The evaluations that matter for agentic coding, and which models top each leaderboard.
Updated June 2026
Models
Providers
Rankings
Filters
Intelligence
Artificial Analysis Intelligence Index
The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
Claude
Leader
Claude Fable 5
59.9
Web
OpenAI
BrowseComp
OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.
OpenAI
Leader
GPT-5.5 Pro
90.1
%
Coding
Cursor
CursorBench 3.1
Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
Claude
Leader
Claude Fable 5
72.9
%
Coding
DeepSWE
Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
OpenAI
Leader
GPT-5.5
70
%
Design
D
Design Arena
A crowdsourced Elo arena for AI-generated design and frontend code. Models go head to head on the same prompt (websites, UI components, games, mobile apps, SVG), and human votes set the rating. Higher Elo is better.
Leader
GLM-5.2
1360
Elo
Coding
FrontierCode Diamond
The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
Claude
Leader
Claude Opus 4.8
13.4
%
Coding
FrontierCode Main
Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
Claude
Leader
Claude Fable 5
46.3
%
Agentic
Meta
GAIA2
Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.
Claude
Leader
Claude Opus 4.6
57
%
Agentic
OSWorld-Verified
The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.
Claude
Leader
Claude Fable 5
85
%
Coding
SWE-bench Verified
The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.
Claude
Leader
Claude Fable 5
95
%
Coding
OpenAI
SWE-Lancer (IC Diamond)
OpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.
OpenAI
Leader
GPT-5.3-Codex
81.4
%
Agentic
Tau2-Bench Telecom
Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.
Claude
Leader
Claude Opus 4.6
99.3
%
Coding
Terminal-Bench 2.0
Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.
OpenAI
Leader
GPT-5.5
84.7
%
Not testing the latest models
Coding
Aider Polyglot
The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.
OpenAI
Leader
GPT-5
88
%
Agentic
Berkeley Function Calling Leaderboard V4
The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.
Claude
Leader
Claude Opus 4.5
77.47
%
Coding
L
LiveCodeBench v6
Contamination-free competitive programming: problems are continuously collected from LeetCode, AtCoder and Codeforces after model cutoffs and scored as pass@1. Higher is better.
DeepSeek
Leader
DeepSeek V4
93.5
%
Agentic
METR 50% Time Horizon
The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.
Claude
Leader
Claude Opus 4.6
718.81
min
Coding
SWE-bench Pro
Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.
OpenAI
Leader
GPT-5.4
59.1
%
Agentic
Vending-Bench 2
Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.
Claude
Leader
Claude Opus 4.6
8017.59
USD