Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/DeepSWE
DeepSWE logo

DeepSWE

Coding

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.

Official source
ClaudeClaude Fable 5 has not been submitted to DeepSWE yet, so it does not appear on this board.

DeepSWE drops a model into real open source repositories as an autonomous coding agent and counts a task as solved only when the agent's final patch passes the hidden test suite. The headline score is the resolve rate averaged over the task set, and the ± figure is the standard error across trials, so close scores with wide bars should be read as a tie. Each run also tracks average cost, time, and output tokens per task, which turns the board into a cost-vs-capability frontier: the same model at a higher reasoning effort usually scores more but costs more per task.

Leaderboard
  • OpenAIGPT-5.5xhigh
    70%±3%
  • ClaudeClaude Opus 4.8max
    58%±0%
  • OpenAIGPT-5.4xhigh
    56%±2%
  • ClaudeClaude Opus 4.7max
    54%±5%
  • ClaudeClaude Sonnet 4.6high
    32%±2%
  • GeminiGemini 3.5 Flashmedium
    28%±4%
  • MoonshotAIKimi K2.6
    24%±2%
  • OpenAIGPT-5.4 Minixhigh
    24%±3%
  • MinimaxMiniMax-M3
    20.5%±0%
  • GeminiGemini 3.1 Pro Preview
    10%±3%
  • DeepSeekDeepSeek V4
    8%±3%
  • GeminiGemini 3 Flash
    5%±2%
0%25%50%75%100%
Score vs. cost
All results
#ModelScoreCost
  • 1
    OpenAIGPT-5.5Extra High
    70%$6.80
  • 2
    OpenAIGPT-5.5High
    62%$4.60
  • 3
    ClaudeClaude Opus 4.8Max
    58%$8.50
  • 4
    ClaudeClaude Opus 4.8Extra High
    57%$7.00
  • 5
    OpenAIGPT-5.4Extra High
    56%$5.50
  • 6
    ClaudeClaude Opus 4.7Max
    54%$16.50
  • 7
    ClaudeClaude Opus 4.8High
    50%$4.50
  • 8
    OpenAIGPT-5.5Medium
    48%$2.40
  • 9
    ClaudeClaude Opus 4.8Medium
    47%$3.30
  • 10
    ClaudeClaude Opus 4.7Extra High
    45%$11.50
  • 11
    ClaudeClaude Opus 4.7High
    40%$5.00
  • 12
    ClaudeClaude Opus 4.7Medium
    32%$3.30
  • 13
    ClaudeClaude Sonnet 4.6High
    32%$4.50
  • 14
    GeminiGemini 3.5 FlashMedium
    28%$7.00
  • 15
    MoonshotAIKimi K2.6
    24%$4.50
  • 16
    OpenAIGPT-5.4 MiniExtra High
    24%$1.50
  • 17
    MinimaxMiniMax-M3
    20.5%$5.50
  • 18
    GeminiGemini 3.1 Pro Preview
    10%$2.00
  • 19
    DeepSeekDeepSeek V4
    8%$5.50
  • 20
    GeminiGemini 3 Flash
    5%$1.50
Sources:
DeepSWE leaderboardDeepSWE blog (Datacurve)
Share:
Details:
  • Category


    Coding
  • Datacurve logoCreated by


    Datacurve
  • Models tested


    12
  • Configs tested


    20
  • Leader


    OpenAIGPT-5.5
  • Top score


    70%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory