Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
New

Work in progress: Agents Directory has just launched. Stay tuned, more content is on the way.

Rankings

Independent model, benchmark, and agent rankings for AI coding, showing what actually leads right now. Refreshed regularly.

Updated June 2026

Browse modelsBenchmarksProviders

Best models for your agent

Best cheap AI models

You no longer need a $15-per-million flagship for most work. These are the best cheap AI models in 2026: everything here costs at most about $1 per million input tokens, ranked by how much real capability you keep while the price drops.
Hermes logo

Best cheap models for Hermes

Hermes runs all day, so the model price is your real subscription fee. These are the best cheap models for Hermes right now: everything costs at most about $0.75 per million input tokens, ranked by how much agent capability survives the price cut.

Best Chinese AI models

Chinese labs now ship frontier-class models at a fraction of US flagship prices, and most of them publish open weights. These are the best Chinese AI models in 2026, ranked on real agentic-coding performance and value, from Moonshot, DeepSeek, Alibaba, MiniMax, Z.AI, and Xiaomi.

Best free coding models on OpenRouter

Want a coding model that costs $0? OpenRouter serves several capable coding models for free (rate limits apply). These are the best free coding models on OpenRouter, ranked on real agentic-coding ability, from quick edits to full agent runs.
Hermes logo

Best free models for Hermes

Want to run Hermes without paying per token? OpenRouter serves plenty of capable models at $0 (rate limits apply). These are the best free models for Hermes, grouped by what you run it for: general agent work, coding, and fast high-volume steps.
OpenClaw logo

Best free models for OpenClaw

Want to run OpenClaw without paying per token? OpenRouter serves plenty of capable models at $0 (rate limits apply). These are the best free models for OpenClaw, grouped by what you run it for: general operator work, coding, and fast high-volume steps.

Best free models on OpenRouter

OpenRouter lists dozens of models you can call at $0 (rate limits apply). These are the best free models on OpenRouter right now, grouped by what you want to do: general use, coding, chat and reasoning, vision, and uncensored.
Claude Code logo

Best models for Claude Code

Claude Code is tuned for the Claude family, but it's worth knowing where each model lands on capability and cost for your workload.
Codex logo

Best models for Codex

Codex only runs OpenAI models, so the real choice is which GPT tier fits your work. Here's how the three options in the picker compare on capability and cost.
Hermes logo

Best models for Hermes

Hermes runs your skills locally and leans on the model for planning and skill use. These are the models that pair best with it right now, grouped by what you actually want to spend and ranked on real agentic-coding performance.
OpenClaw logo

Best models for OpenClaw

OpenClaw lives in your messaging apps and acts on your behalf, so it rewards models with reliable skill use, long sessions, and sane costs. These are the models that pair best with it right now, grouped by what you actually want to spend and weighed against the always-on token bill.

Best models to run with Ollama

Running models locally with Ollama means no API keys, no rate limits, and no data leaving your machine. These are the best open-weight models to run locally in 2026, grouped by the hardware you actually have, from a 16GB laptop to a multi-GPU server.

Best open-source AI models

Open-weight models you can download, inspect, and self-host now sit within reach of the closed frontier. These are the best open-source AI models in 2026, ranked on real capability, from full-size flagships to models you can run on a workstation.
Hermes logo

Best open-source models for Hermes

Open-weight models you can inspect, fine-tune, and self-host. Ideal for privacy-sensitive or air-gapped Hermes setups. We rank a deep bench here because Hermes routes more than 380 different models in the wild, and most of its real volume goes to open weights.
OpenClaw logo

Best open-source models for OpenClaw

Open-weight models you can inspect, fine-tune, and self-host. A natural fit for OpenClaw, which is a self-hostable operator runtime, so you can keep the whole stack in-house with no per-token bill. We rank a deep bench here because most of OpenClaw's real volume goes to open weights.

Best Qwen models

Alibaba's Qwen family is the broadest model lineup in open AI: frontier flagships, coding specialists, vision models, and tiny MoEs that run on a laptop. These are the best Qwen models in 2026 and what each one is actually for.

Best models by AgentsDirectory Index

Overall capability across every active benchmark, weighted by credibility.

#ModelProviderOpenADICursorBenchContextInput price
  • 1
    ClaudeClaude Fable 5
    ProviderAnthropic
    Open—
    ADI100
    CursorBench72.9%
    Context1M
    Input price$10/M
  • 2
    ClaudeClaude Opus 4.8
    ProviderAnthropic
    Open—
    ADI97.3
    CursorBench63.8%
    Context1M
    Input price$5/M
  • 3
    OpenAIGPT-5.5
    ProviderOpenAI
    Open—
    ADI97.2
    CursorBench64.3%
    Context1.05M
    Input price$5/M
  • 4
    OpenAIGPT-5.4
    ProviderOpenAI
    Open—
    ADI90.2
    CursorBench—
    Context1.05M
    Input price$2.5/M
  • 5
    ClaudeClaude Opus 4.7
    ProviderAnthropic
    Open—
    ADI89.7
    CursorBench64.8%
    Context1M
    Input price$5/M
  • 6
    OpenAIGPT-5.2
    ProviderOpenAI
    Open—
    ADI87.9
    CursorBench—
    Context400K
    Input price$1.75/M
  • 7
    MinimaxMiniMax-M3
    ProviderMinimax
    Open—
    ADI82.4
    CursorBench—
    Context205K
    Input price$0.3/M
  • 8
    GeminiGemini 3.1 Pro Preview
    ProviderGoogle
    Open—
    ADI77.9
    CursorBench—
    Context1.049M
    Input price$2/M
  • 9
    ClaudeClaude Sonnet 4.6
    ProviderAnthropic
    Open—
    ADI77.8
    CursorBench49%
    Context1M
    Input price$3/M
  • 10
    GeminiGemini 3 Flash
    ProviderGoogle
    Open—
    ADI76.3
    CursorBench—
    Context1.049M
    Input price$0.5/M
  • 11
    MoonshotAIKimi K2.6
    ProviderMoonshot
    Open
    ADI73.3
    CursorBench47.6%
    Context262K
    Input price$0.66/M
  • 12
    OpenAIGPT-5.4 Mini
    ProviderOpenAI
    Open—
    ADI72.2
    CursorBench—
    Context400K
    Input price$0.75/M
  • 13
    MoonshotAIKimi K2.5
    ProviderMoonshot
    Open
    ADI65.3
    CursorBench31.9%
    Context262K
    Input price$0.375/M

The AgentsDirectory Index (ADI) is our own composite score: it combines the field's major coding and agent benchmarks, so the ranking is not tied to any one vendor's leaderboard. ADI is not itself a benchmark, it aggregates them.

How the index works

Benchmark leaderboards

Artificial Analysis Intelligence Index logo

Artificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.
ClaudeLeaderClaude Fable 559.9
SWE-bench Verified logo

SWE-bench Verified

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.
ClaudeLeaderClaude Fable 595%
DeepSWE logo

DeepSWE

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
OpenAILeaderGPT-5.570%
FrontierCode Main logo

FrontierCode Main

Cognition's test of whether a model writes code maintainers would actually merge, not just code that passes tests. Main is the 100 hardest of 150 tasks. Higher is better.
ClaudeLeaderClaude Fable 546.3%
Cursor

CursorBench 3.1

Ambiguous, multi-file tasks from real Cursor sessions that test codebase understanding, bugfinding, planning, and code review.
ClaudeLeaderClaude Fable 572.9%
Aider Polyglot logo

Aider Polyglot

The practitioner favorite for code editing: 225 hard Exercism exercises across six languages, solved end to end through the aider tool and checked by unit tests. Higher is better.
OpenAILeaderGPT-588%
Berkeley Function Calling Leaderboard V4 logo

Berkeley Function Calling Leaderboard V4

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.
ClaudeLeaderClaude Opus 4.577.47%
OpenAI

BrowseComp

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.
OpenAILeaderGPT-5.5 Pro90.1%
D

Design Arena

A crowdsourced Elo arena for AI-generated design and frontend code. Models go head to head on the same prompt (websites, UI components, games, mobile apps, SVG), and human votes set the rating. Higher Elo is better.
GLM-5.2 logoLeaderGLM-5.21360 Elo
FrontierCode Diamond logo

FrontierCode Diamond

The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.
ClaudeLeaderClaude Opus 4.813.4%
Meta

GAIA2

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.
ClaudeLeaderClaude Opus 4.657%
L

LiveCodeBench v6

Contamination-free competitive programming: problems are continuously collected from LeetCode, AtCoder and Codeforces after model cutoffs and scored as pass@1. Higher is better.
DeepSeekLeaderDeepSeek V493.5%
METR 50% Time Horizon logo

METR 50% Time Horizon

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.
ClaudeLeaderClaude Opus 4.6718.81 min
OSWorld-Verified logo

OSWorld-Verified

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.
ClaudeLeaderClaude Fable 585%
SWE-bench Pro logo

SWE-bench Pro

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.
OpenAILeaderGPT-5.459.1%
OpenAI

SWE-Lancer (IC Diamond)

OpenAI's freelance-work benchmark: real Upwork software tasks with real dollar payouts, scored as the share of the task pool's value a model earns. Higher is better.
OpenAILeaderGPT-5.3-Codex81.4%
Tau2-Bench Telecom logo

Tau2-Bench Telecom

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.
ClaudeLeaderClaude Opus 4.699.3%
Terminal-Bench 2.0 logo

Terminal-Bench 2.0

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.
OpenAILeaderGPT-5.584.7%
Vending-Bench 2 logo

Vending-Bench 2

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.
ClaudeLeaderClaude Opus 4.68017.59 USD

Top AI coding agents

Agent
  • Hermes logo
    HermesSkills, Integrations & Self-hosting for Hermes
  • Claude Code logo
    Claude CodeSkills, Plugins & MCP Servers for Claude Code
  • Codex logo
    CodexSkills & MCP Servers for OpenAI Codex
  • OpenClaw logo
    OpenClawSkills & Automation for OpenClaw

Want real usage numbers? See the full agent leaderboard, ranked by tokens routed through OpenRouter and refreshed daily.

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory