Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/SWE-bench Pro
SWE-bench Pro logo

SWE-bench Pro

Coding

Scale AI's harder, contamination-resistant successor to SWE-bench: 731 public long-horizon engineering tasks, every model run on identical scaffolding. Higher is better.

Official source
Scale's standardized board lags vendor launches by months: ClaudeClaude Fable 5, ClaudeClaude Opus 4.8 and OpenAIGPT-5.5 have no Scale-run scores yet (Anthropic self-reports Fable 5 at 80.3 with its own harness, which is not comparable).

Tasks are built from curated repositories with reproducible Docker environments, commit-scraped fail-to-pass test transitions, and human-augmented problem statements verified at three checkpoints. The full benchmark has 1,865 problems across a public set (731 GPL-repo instances, the leaderboard shown here), a commercial set of proprietary startup codebases, and a held-out set. A task counts as resolved only if the failing tests pass and existing tests do not regress, and Scale runs every model through the same mini-swe-agent harness, so scores isolate model capability from harness quality. Scores carry 95% confidence intervals, shown as whiskers on the leaderboard.

Leaderboard
  • OpenAIGPT-5.4xhigh
    59.1%±3.56%
  • ClaudeClaude Opus 4.6Thinking
    51.9%±3.61%
  • GeminiGemini 3.1 Pro PreviewThinking
    46.1%±3.6%
  • ClaudeClaude Opus 4.5
    45.89%±3.6%
  • ClaudeClaude Sonnet 4.5
    43.6%±3.6%
  • GeminiGemini 3 Pro
    43.3%±3.6%
  • OpenAIGPT-5high
    41.78%±3.49%
  • OpenAIGPT-5.2 Codex
    41.04%±3.57%
  • ClaudeClaude Haiku 4.5
    39.45%±3.55%
  • QwenQwen3 Coder 480B A35B
    38.7%±3.55%
  • MinimaxMiniMax M2.1
    36.81%±3.55%
  • GeminiGemini 3 Flash
    34.63%±3.55%
  • OpenAIGPT-5.2
    29.94%±2.15%
  • QwenQwen3 235B A22B
    21.41%±2.25%
  • OpenAIgpt-oss-120b
    16.2%±2.67%
  • DeepSeekDeepSeek V3.2
    15.56%±2.63%
  • GeminiGemma 3 27B
    11.38%±2.15%
0%25%50%75%100%
All results
#ModelScoreProvider
  • 1
    OpenAIGPT-5.4xHigh
    59.1%OpenAI
  • 2
    ClaudeClaude Opus 4.6Thinking
    51.9%Anthropic
  • 3
    GeminiGemini 3.1 Pro PreviewThinking
    46.1%Google DeepMind
  • 4
    ClaudeClaude Opus 4.5
    45.89%Anthropic
  • 5
    ClaudeClaude Sonnet 4.5
    43.6%Anthropic
  • 6
    GeminiGemini 3 Pro
    43.3%Google DeepMind
  • 7
    OpenAIGPT-5High
    41.78%OpenAI
  • 8
    OpenAIGPT-5.2 Codex
    41.04%OpenAI
  • 9
    ClaudeClaude Haiku 4.5
    39.45%Anthropic
  • 10
    QwenQwen3 Coder 480B A35B
    38.7%Qwen
  • 11
    MinimaxMiniMax M2.1
    36.81%MiniMax
  • 12
    GeminiGemini 3 Flash
    34.63%Google DeepMind
  • 13
    OpenAIGPT-5.2
    29.94%OpenAI
  • 14
    QwenQwen3 235B A22B
    21.41%Qwen
  • 15
    OpenAIgpt-oss-120b
    16.2%OpenAI
  • 16
    DeepSeekDeepSeek V3.2
    15.56%DeepSeek
  • 17
    GeminiGemma 3 27B
    11.38%Google DeepMind
Sources:
morphllm SWE-bench Pro trackerScale SEAL SWE-bench Pro (public set)ScaleAI/SWE-bench_Pro on Hugging FaceSWE-bench Pro paper (arXiv 2509.16941)scaleapi/SWE-bench_Pro-os
Share:
Details:
  • Category


    Coding
  • Scale AI logoCreated by


    Scale AI
  • Models tested


    17
  • Leader


    OpenAIGPT-5.4
  • Top score


    59.1%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory