Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/METR 50% Time Horizon
METR 50% Time Horizon logo

METR 50% Time Horizon

Agentic

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.

Official source
METR has not yet published horizons for ClaudeClaude Fable 5, ClaudeClaude Opus 4.8 or OpenAIGPT-5.5; the newest measured frontier model is ClaudeClaude Opus 4.6.

METR runs frontier models as autonomous agents on 228 diverse software and reasoning tasks (the Time Horizon 1.1 suite, built on HCAST, RE-Bench and SWAA). Each task has a human baseline: the time skilled professionals take to complete it. For every model METR fits a logistic curve relating human task length to the model's success probability, and the 50% time horizon is the task length where that curve crosses 50% success. Horizons have grown exponentially, doubling roughly every 129 days since 2023 under the v1.1 analysis. METR cautions that measurements above 16 hours are unreliable with the current task suite, and confidence intervals on the newest models are wide.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Opus 4.6
    718.81 minAnthropic
  • 2
    GeminiGemini 3.1 Pro Preview
    384.15 minGoogle DeepMind
  • 3
    OpenAIGPT-5.2
    352.25 minOpenAI
  • 4
    OpenAIGPT-5.3-Codex
    349.53 minOpenAI
  • 5
    OpenAIGPT-5.4
    341.74 minOpenAI
  • 6
    ClaudeClaude Opus 4.5
    292.99 minAnthropic
  • 7
    GeminiGemini 3 Pro
    224.33 minGoogle DeepMind
  • 8
    OpenAIGPT-5.1-Codex-Max
    223.71 minOpenAI
  • 9
    OpenAIGPT-5
    203.01 minOpenAI
  • 10
    OpenAIo3
    119.73 minOpenAI
  • 11
    ClaudeClaude Opus 4.1
    100.47 minAnthropic
  • 12
    ClaudeClaude Opus 4
    100.37 minAnthropic
  • 13
    OpenAIo1
    38.83 minOpenAI
  • 14
    OpenAIGPT-4
    3.99 minOpenAI
  • 15
    OpenAIGPT-4 Turbo
    3.73 minOpenAI
  • 16
    OpenAIGPT-3.5 Turbo Instruct
    0.6 minOpenAI
Sources:
Per-model results YAML (benchmark_results_1_1.yaml)METR time horizons data pageTime Horizon 1.1 (METR blog)Measuring AI Ability to Complete Long Tasks (arXiv 2503.14499)METR/eval-analysis-public
Share:
Details:
  • Category


    Agentic
  • METR logoCreated by


    METR
  • Models tested


    16
  • Leader


    ClaudeClaude Opus 4.6
  • Top score


    718.81 min

Updated May 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory