New
Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.
Rankings
Independent model, benchmark, and agent rankings for AI coding, showing what actually leads right now. Refreshed regularly.
Updated June 2026
Best models for your agent
Best cheap AI models
You no longer need a $15-per-million flagship for most work. These are the best cheap AI models in 2026: everything here costs at most about $1 per million input tokens, ranked by how much real capability you keep while the price drops.
Best cheap models for Hermes
Hermes runs all day, so the model price is your real subscription fee. These are the best cheap models for Hermes right now: everything costs at most about $0.75 per million input tokens, ranked by how much agent capability survives the price cut.
Best Chinese AI models
Chinese labs now ship frontier-class models at a fraction of US flagship prices, and most of them publish open weights. These are the best Chinese AI models in 2026, ranked on real agentic-coding performance and value, from Moonshot, DeepSeek, Alibaba, MiniMax, Z.AI, and Xiaomi.
Best free coding models on OpenRouter
Want a coding model that costs $0? OpenRouter serves several capable coding models for free (rate limits apply). These are the best free coding models on OpenRouter, ranked on real agentic-coding ability, from quick edits to full agent runs.
Best free models for Hermes
Want to run Hermes without paying per token? OpenRouter serves plenty of capable models at $0 (rate limits apply). These are the best free models for Hermes, grouped by what you run it for: general agent work, coding, and fast high-volume steps.
Best free models for OpenClaw
Want to run OpenClaw without paying per token? OpenRouter serves plenty of capable models at $0 (rate limits apply). These are the best free models for OpenClaw, grouped by what you run it for: general operator work, coding, and fast high-volume steps.
Best free models on OpenRouter
OpenRouter lists dozens of models you can call at $0 (rate limits apply). These are the best free models on OpenRouter right now, grouped by what you want to do: general use, coding, chat and reasoning, vision, and uncensored.
Best models for Claude Code
Claude Code is tuned for the Claude family, but it's worth knowing where each model lands on capability and cost for your workload.
Best models for Codex
Codex only runs OpenAI models, so the real choice is which GPT tier fits your work. Here's how the three options in the picker compare on capability and cost.
Best models for Hermes
Hermes runs your skills locally and leans on the model for planning and skill use. These are the models that pair best with it right now, grouped by what you actually want to spend and ranked on real agentic-coding performance.
Best models for OpenClaw
OpenClaw lives in your messaging apps and acts on your behalf, so it rewards models with reliable skill use, long sessions, and sane costs. These are the models that pair best with it right now, grouped by what you actually want to spend and weighed against the always-on token bill.
Best models to run with Ollama
Running models locally with Ollama means no API keys, no rate limits, and no data leaving your machine. These are the best open-weight models to run locally in 2026, grouped by the hardware you actually have, from a 16GB laptop to a multi-GPU server.
Best open-source AI models
Open-weight models you can download, inspect, and self-host now sit within reach of the closed frontier. These are the best open-source AI models in 2026, ranked on real capability, from full-size flagships to models you can run on a workstation.
Best open-source models for Hermes
Open-weight models you can inspect, fine-tune, and self-host. Ideal for privacy-sensitive or air-gapped Hermes setups. We rank a deep bench here because Hermes routes more than 380 different models in the wild, and most of its real volume goes to open weights.
Best open-source models for OpenClaw
Open-weight models you can inspect, fine-tune, and self-host. A natural fit for OpenClaw, which is a self-hostable operator runtime, so you can keep the whole stack in-house with no per-token bill. We rank a deep bench here because most of OpenClaw's real volume goes to open weights.
Best Qwen models
Alibaba's Qwen family is the broadest model lineup in open AI: frontier flagships, coding specialists, vision models, and tiny MoEs that run on a laptop. These are the best Qwen models in 2026 and what each one is actually for.
Best models by AgentsDirectory Index
Overall capability across every active benchmark, weighted by credibility.
#ModelProviderOpenADICursorBenchContextInput price
The AgentsDirectory Index (ADI) is our own composite score: it combines the field's major coding and agent benchmarks, so the ranking is not tied to any one vendor's leaderboard. ADI is not itself a benchmark, it aggregates them.
How the index worksBenchmark leaderboards
Artificial Analysis Intelligence Index
The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
LeaderClaude Fable 559.9
SWE-bench Verified
The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.
LeaderClaude Fable 595%
DeepSWE
Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
LeaderGPT-5.570%
FrontierCode Main
Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
LeaderClaude Fable 546.3%
CursorBench 3.1
Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
LeaderClaude Fable 572.9%
Aider Polyglot
The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.
LeaderGPT-588%
Berkeley Function Calling Leaderboard V4
The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.
LeaderClaude Opus 4.577.47%
BrowseComp
OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.
LeaderGPT-5.5 Pro90.1%
D
Design Arena
A crowdsourced Elo arena for AI-generated design and frontend code. Models go head to head on the same prompt (websites, UI components, games, mobile apps, SVG), and human votes set the rating. Higher Elo is better.
FrontierCode Diamond
The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
LeaderClaude Opus 4.813.4%
GAIA2
Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.
LeaderClaude Opus 4.657%
L
LiveCodeBench v6
Contamination-free competitive programming: problems are continuously collected from LeetCode, AtCoder and Codeforces after model cutoffs and scored as pass@1. Higher is better.
LeaderDeepSeek V493.5%
METR 50% Time Horizon
The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.
LeaderClaude Opus 4.6718.81 min
OSWorld-Verified
The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.
LeaderClaude Fable 585%
SWE-bench Pro
Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.
LeaderGPT-5.459.1%
SWE-Lancer (IC Diamond)
OpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.
LeaderGPT-5.3-Codex81.4%
Tau2-Bench Telecom
Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.
LeaderClaude Opus 4.699.3%
Terminal-Bench 2.0
Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.
LeaderGPT-5.584.7%
Vending-Bench 2
Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.
LeaderClaude Opus 4.68017.59 USD
Top AI coding agents
Agent
HermesSkills, Integrations & Self-hosting for Hermes- Claude CodeSkills, Plugins & MCP Servers for Claude Code
- CodexSkills & MCP Servers for OpenAI Codex
- OpenClawSkills & Automation for OpenClaw
Want real usage numbers? See the full agent leaderboard, ranked by tokens routed through OpenRouter and refreshed daily.