Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
New

Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.

/Benchmarks

AI coding benchmarks

The evaluations that matter for agentic coding, and which models top each leaderboard.

Updated June 2026
ModelsProvidersRankings
Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory

IntelligenceArtificial Analysis Intelligence Index logo

Artificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
ClaudeLeaderClaude Fable 559.9
WebOpenAI

BrowseComp

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.
OpenAILeaderGPT-5.5 Pro90.1%
CodingCursor

CursorBench 3.1

Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
ClaudeLeaderClaude Fable 572.9%
CodingDeepSWE logo

DeepSWE

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
OpenAILeaderGPT-5.570%
DesignD

Design Arena

A crowdsourced Elo arena for AI-generated design and frontend code. Models go head to head on the same prompt (websites, UI components, games, mobile apps, SVG), and human votes set the rating. Higher Elo is better.
GLM-5.2 logoLeaderGLM-5.21360 Elo
CodingFrontierCode Diamond logo

FrontierCode Diamond

The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
ClaudeLeaderClaude Opus 4.813.4%
CodingFrontierCode Main logo

FrontierCode Main

Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
ClaudeLeaderClaude Fable 546.3%
AgenticMeta

GAIA2

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.
ClaudeLeaderClaude Opus 4.657%
AgenticOSWorld-Verified logo

OSWorld-Verified

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.
ClaudeLeaderClaude Fable 585%
CodingSWE-bench Verified logo

SWE-bench Verified

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.
ClaudeLeaderClaude Fable 595%
CodingOpenAI

SWE-Lancer (IC Diamond)

OpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.
OpenAILeaderGPT-5.3-Codex81.4%
AgenticTau2-Bench Telecom logo

Tau2-Bench Telecom

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.
ClaudeLeaderClaude Opus 4.699.3%
CodingTerminal-Bench 2.0 logo

Terminal-Bench 2.0

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.
OpenAILeaderGPT-5.584.7%
Not testing the latest models
CodingAider Polyglot logo

Aider Polyglot

The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.
OpenAILeaderGPT-588%
AgenticBerkeley Function Calling Leaderboard V4 logo

Berkeley Function Calling Leaderboard V4

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.
ClaudeLeaderClaude Opus 4.577.47%
CodingL

LiveCodeBench v6

Contamination-free competitive programming: problems are continuously collected from LeetCode, AtCoder and Codeforces after model cutoffs and scored as pass@1. Higher is better.
DeepSeekLeaderDeepSeek V493.5%
AgenticMETR 50% Time Horizon logo

METR 50% Time Horizon

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.
ClaudeLeaderClaude Opus 4.6718.81 min
CodingSWE-bench Pro logo

SWE-bench Pro

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.
OpenAILeaderGPT-5.459.1%
AgenticVending-Bench 2 logo

Vending-Bench 2

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.
ClaudeLeaderClaude Opus 4.68017.59 USD