Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/SWE-bench Verified
SWE-bench Verified logo

SWE-bench Verified

Coding

The most-cited agentic coding benchmark: can a model fix a real GitHub issue in a real repository? 500 human-validated tasks, scored by the repo's own tests. Higher is better.

Official source
Vendor harness rows and standardized mini-SWE-agent rows are listed side by side; compare models within one regime, not across them.

SWE-bench Verified is a 500-instance subset of SWE-bench, screened by human annotators (an OpenAI collaboration with the Princeton authors) to remove underspecified or unsolvable tasks. Each task pairs a real GitHub issue with a repository snapshot, and the model must produce a patch that makes the issue's failing tests pass without breaking existing tests. Two score regimes coexist: vendors self-report scores with their own agent harnesses (the headline rows here), while the official swebench.com leaderboard reruns models on a standardized mini-SWE-agent scaffold that typically lands 10 to 20 points lower and publishes an average cost per task (those rows power the score vs cost chart). Tasks are Python-only issues from pre-2023 repos, so contamination is a standing concern and top scores are nearing saturation.

Score vs. cost
Leaderboard
#ModelScoreCost
  • 1
    ClaudeClaude Fable 5Vendor harness
    95%—
  • 2
    OpenAIGPT-5.5Vendor harness
    88.7%—
  • 3
    ClaudeClaude Opus 4.8Vendor harness
    88.6%—
  • 4
    ClaudeClaude Opus 4.7Vendor harness
    87.6%—
  • 5
    DeepSeekDeepSeek V4Pro Max, vendor harness
    80.6%—
  • 6
    GeminiGemini 3.1 Pro PreviewVendor harness
    80.6%—
  • 7
    MinimaxMiniMax-M3Vendor harness
    80.5%—
  • 8
    QwenQwen3.7 MaxVendor harness
    80.4%—
  • 9
    MoonshotAIKimi K2.6Vendor harness
    80.2%—
  • 10
    OpenAIGPT-5.2Vendor harness
    80%—
  • 11
    ClaudeClaude Sonnet 4.6Vendor harness
    79.6%—
  • 12
    ClaudeClaude Opus 4.5live-SWE-agent
    79.2%—
  • 13
    DeepSeekDeepSeek V4 FlashMax, vendor harness
    79%—
  • 14
    MMiMo-V2.5-ProVendor harness
    78.9%—
  • 15
    MistralMistral Medium 3.5Vendor harness
    77.6%—
  • 16
    ClaudeClaude Opus 4.5mini-SWE-agent, High
    76.8%$0.75
  • 17
    GeminiGemini 3 Flashmini-SWE-agent, High
    75.8%$0.36
  • 18
    MinimaxMiniMax M2.5mini-SWE-agent, High
    75.8%$0.07
  • 19
    ClaudeClaude Opus 4.6mini-SWE-agent
    75.6%$0.55
  • 20
    GLM 5 logoGLM 5mini-SWE-agent, High
    72.8%$0.53
  • 21
    OpenAIGPT-5.2 Codexmini-SWE-agent
    72.8%$0.45
  • 22
    OpenAIGPT-5.2mini-SWE-agent, High
    72.8%$0.47
  • 23
    ClaudeClaude Sonnet 4.5mini-SWE-agent, High
    71.4%$0.66
Sources:
Official leaderboard data (leaderboards.json)SWE-bench official leaderboardOpenAI: Introducing SWE-bench VerifiedSWE-bench_Verified on Hugging FaceSWE-bench paper (ICLR 2024)
Share:
Details:
  • Category


    Coding
  • SWE-bench logoCreated by


    SWE-bench
  • Models tested


    21
  • Configs tested


    23
  • Leader


    ClaudeClaude Fable 5
  • Top score


    95%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory