Agents Directory

Skills Rankings Agents

Categories Models Benchmarks Compare Agent Leaderboard Skills Rankings Agents About

New

Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.

AI coding benchmarks

The evaluations that matter for agentic coding, and which models top each leaderboard.

Updated June 2026

Models Providers Rankings

Browse:Skills Rankings Models Benchmarks Providers Agents Agent Leaderboard Compare Categories

Quick Links:About Blog

© 2026 Agents Directory

Artificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.

LeaderClaude Fable 559.9

BrowseComp

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.

LeaderGPT-5.5 Pro90.1%

CursorBench 3.1

Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.

LeaderClaude Fable 572.9%

DeepSWE

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.

LeaderGPT-5.570%

Design Arena

A crowdsourced Elo arena for AI-generated design and frontend code. Models go head to head on the same prompt (websites, UI components, games, mobile apps, SVG), and human votes set the rating. Higher Elo is better.

LeaderGLM-5.21360 Elo

FrontierCode Diamond

The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.

LeaderClaude Opus 4.813.4%

FrontierCode Main

Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.

LeaderClaude Fable 546.3%

GAIA2

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

LeaderClaude Opus 4.657%

OSWorld-Verified

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

LeaderClaude Fable 585%

SWE-bench Verified

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.

LeaderClaude Fable 595%

SWE-Lancer (IC Diamond)

OpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.

LeaderGPT-5.3-Codex81.4%

Tau2-Bench Telecom

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

LeaderClaude Opus 4.699.3%

Terminal-Bench 2.0

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.

LeaderGPT-5.584.7%

Not testing the latest models

Aider Polyglot

The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.

Berkeley Function Calling Leaderboard V4

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.

LeaderClaude Opus 4.577.47%

LiveCodeBench v6

Contamination-free competitive programming: problems are continuously collected from LeetCode, AtCoder and Codeforces after model cutoffs and scored as pass@1. Higher is better.

LeaderDeepSeek V493.5%

METR 50% Time Horizon

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.

LeaderClaude Opus 4.6718.81 min

SWE-bench Pro

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.

LeaderGPT-5.459.1%

Vending-Bench 2

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.

LeaderClaude Opus 4.68017.59 USD