Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/FrontierCode Diamond
FrontierCode Diamond logo

FrontierCode Diamond

Coding

The 50 hardest FrontierCode tasks: the toughest production-code problems, graded on whether maintainers would merge the patch. Scores stay low by design. Higher is better.

Official source
ClaudeClaude Fable 5's Diamond score is still pending from Cognition, so it does not yet appear on this board.

Diamond is the 50 most difficult FrontierCode tasks, the hardest production-code problems from real open source repositories. They use the same maintainer-merge rubric with hard blocking criteria (correctness, regression safety, scope). Score is the gated weighted rubric value, counted only once a trial clears every blocker (else 0), averaged over the tasks. As the toughest agentic-coding measure on the board, scores stay low.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Opus 4.8
    13.4%Anthropic
  • 2
    OpenAIGPT-5.5
    6.3%OpenAI
  • 3
    ClaudeClaude Opus 4.7
    5.2%Anthropic
  • 4
    GeminiGemini 3.1 Pro Preview
    4.7%Google DeepMind
  • 5
    OpenAIGPT-5.4 Mini
    4.6%OpenAI
  • 6
    MoonshotAIKimi K2.6
    3.8%Moonshot AI
  • 7
    ClaudeClaude Sonnet 4.6
    3.5%Anthropic
  • 8
    MinimaxMiniMax M2.7
    2.4%MiniMax
  • 9
    MinimaxMiniMax M2.5
    1.1%MiniMax
  • 10
    MoonshotAIKimi K2.5
    1%Moonshot AI
  • 11
    GeminiGemini 3.1 Flash Lite
    0.7%Google DeepMind
Sources:
FrontierCode (Cognition)
Share:
Details:
  • Category


    Coding
  • Cognition logoCreated by


    Cognition
  • Models tested


    11
  • Leader


    ClaudeClaude Opus 4.8
  • Top score


    13.4%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory