DeepSWE
CodingDatacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.
Claude Fable 5 has not been submitted to DeepSWE yet, so it does not appear on this board.
DeepSWE drops a model into real open source repositories as an autonomous coding agent and counts a task as solved only when the agent's final patch passes the hidden test suite. The headline score is the resolve rate averaged over the task set, and the ± figure is the standard error across trials, so close scores with wide bars should be read as a tie. Each run also tracks average cost, time, and output tokens per task, which turns the board into a cost-vs-capability frontier: the same model at a higher reasoning effort usually scores more but costs more per task.
Leaderboard
- GPT-5.5xhigh70%±3%
- 58%
- GPT-5.4xhigh56%±2%
- 54%±5%
- 32%±2%
- Gemini 3.5 Flashmedium28%±4%
- 24%±2%
- GPT-5.4 Minixhigh24%±3%
- 20.5%
- 10%±3%
- 8%±3%
- 5%±2%
0%25%50%75%100%
Score vs. cost
All results
#ModelScoreCost
- 1GPT-5.5Extra High70%$6.80
- 2GPT-5.5High62%$4.60
- 358%$8.50
- 4Claude Opus 4.8Extra High57%$7.00
- 5GPT-5.4Extra High56%$5.50
- 654%$16.50
- 7Claude Opus 4.8High50%$4.50
- 8GPT-5.5Medium48%$2.40
- 9Claude Opus 4.8Medium47%$3.30
- 10Claude Opus 4.7Extra High45%$11.50
- 11Claude Opus 4.7High40%$5.00
- 12Claude Opus 4.7Medium32%$3.30
- 1332%$4.50
- 14Gemini 3.5 FlashMedium28%$7.00
- 1524%$4.50
- 16GPT-5.4 MiniExtra High24%$1.50
- 1720.5%$5.50
- 1810%$2.00
- 198%$5.50
- 205%$1.50