METR 50% Time Horizon

Name: METR 50% Time Horizon leaderboard
Creator: METR

Agentic

The length of human expert task time (in minutes) a model can complete autonomously with 50% reliability. The field's most-cited capability trend. Higher is better.

METR has not yet published horizons for Claude Fable 5, Claude Opus 4.8 or GPT-5.5; the newest measured frontier model is Claude Opus 4.6.

METR runs frontier models as autonomous agents on 228 diverse software and reasoning tasks (the Time Horizon 1.1 suite, built on HCAST, RE-Bench and SWAA). Each task has a human baseline: the time skilled professionals take to complete it. For every model METR fits a logistic curve relating human task length to the model's success probability, and the 50% time horizon is the task length where that curve crosses 50% success. Horizons have grown exponentially, doubling roughly every 129 days since 2023 under the v1.1 analysis. METR cautions that measurements above 16 hours are unreliable with the current task suite, and confidence intervals on the newest models are wide.

Leaderboard

#ModelScoreProvider