Agents Directory

Skills Rankings Agents

Categories Models Benchmarks Compare Agent Leaderboard Skills Rankings Agents About

DeepSWE

Coding

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.

Official source

Claude Fable 5 has not been submitted to DeepSWE yet, so it does not appear on this board.

DeepSWE drops a model into real open source repositories as an autonomous coding agent and counts a task as solved only when the agent's final patch passes the hidden test suite. The headline score is the resolve rate averaged over the task set, and the ± figure is the standard error across trials, so close scores with wide bars should be read as a tie. Each run also tracks average cost, time, and output tokens per task, which turns the board into a cost-vs-capability frontier: the same model at a higher reasoning effort usually scores more but costs more per task.

Leaderboard

GPT-5.5xhigh
70%±3%
Claude Opus 4.8max
58%
GPT-5.4xhigh
56%±2%
Claude Opus 4.7max
54%±5%
Claude Sonnet 4.6high
32%±2%
Gemini 3.5 Flashmedium
28%±4%
Kimi K2.6
24%±2%
GPT-5.4 Minixhigh
24%±3%
MiniMax-M3
20.5%
Gemini 3.1 Pro Preview
10%±3%
DeepSeek V4
8%±3%
Gemini 3 Flash
5%±2%

0%25%50%75%100%

Score vs. cost

All results

#ModelScoreCost

1
GPT-5.5Extra High
70%$6.80
2
GPT-5.5High
62%$4.60
3
Claude Opus 4.8Max
58%$8.50
4
Claude Opus 4.8Extra High
57%$7.00
5
GPT-5.4Extra High
56%$5.50
6
Claude Opus 4.7Max
54%$16.50
7
Claude Opus 4.8High
50%$4.50
8
GPT-5.5Medium
48%$2.40
9
Claude Opus 4.8Medium
47%$3.30
10
Claude Opus 4.7Extra High
45%$11.50
11
Claude Opus 4.7High
40%$5.00
12
Claude Opus 4.7Medium
32%$3.30
13
Claude Sonnet 4.6High
32%$4.50
14
Gemini 3.5 FlashMedium
28%$7.00
15
Kimi K2.6
24%$4.50
16
GPT-5.4 MiniExtra High
24%$1.50
17
MiniMax-M3
20.5%$5.50
18
Gemini 3.1 Pro Preview
10%$2.00
19
DeepSeek V4
8%$5.50
20
Gemini 3 Flash
5%$1.50

Sources:

DeepSWE leaderboard DeepSWE blog (Datacurve)

Share:

Details:

Category
Coding
Created by
Datacurve
Models tested
12
Configs tested
20
Leader
GPT-5.5
Top score
70%

Updated June 2026

Browse:Skills Rankings Models Benchmarks Providers Agents Agent Leaderboard Compare Categories

Quick Links:About Blog

© 2026 Agents Directory