Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/Terminal-Bench 2.0
Terminal-Bench 2.0 logo

Terminal-Bench 2.0

Coding

Measures how well AI agents complete real tasks in a terminal: software engineering, ML, security and data science, each run in a containerized sandbox. Higher is better.

Official source
A newer Terminal-Bench 2.1 board is live but still sparse (11 entries, including ClaudeClaude Opus 4.8 with Claude Code at 78.9). Scores are not comparable across versions, so this board stays on 2.0 until 2.1 fills out.

Terminal-Bench 2.0 contains 89 verified tasks that each run in a containerized terminal environment via the Harbor framework. An agent gets a task instruction, works inside the sandbox, and the run is scored pass or fail by task-specific checks, so the headline number is the percentage of tasks resolved, published with a confidence interval. Every entry is an agent harness plus model pair, not a bare model: the same model scores differently depending on the scaffold wrapping it (Codex CLI, Claude Code, Droid, the reference Terminus 2 agent and others), with harness quality worth several points. We list each model's strongest harness rows with the harness named on the row; submissions are verified by the Terminal-Bench team.

Leaderboard
  • OpenAIGPT-5.5NexAU-AHE
    84.7%±2.1%
  • ClaudeClaude Opus 4.7WOZCODE
    80.2%±2.1%
  • GeminiGemini 3.1 Pro PreviewTongAgents
    80.2%±2.6%
  • OpenAIGPT-5.3-CodexSageAgent
    78.4%±2.2%
  • ClaudeClaude Opus 4.6Meta-Harness
    76.4%±2.4%
  • GeminiGemini 3 ProAnte
    69.4%±2.1%
  • OpenAIGPT-5.2 CodexDeep Agents
    66.5%±3.1%
  • OpenAIGPT-5.2Droid
    64.9%±2.8%
  • GeminiGemini 3 FlashJunie CLI
    64.3%±2.8%
  • ClaudeClaude Opus 4.5Droid
    63.1%±2.7%
  • OpenAIGPT-5.1-Codex-Minihookele
    61.6%±1.9%
  • OpenAIGPT-5.1-Codex-MaxCodex CLI
    60.4%±2.7%
0%25%50%75%100%
All results
#ModelScoreProvider
  • 1
    OpenAIGPT-5.5NexAU-AHE
    84.7%OpenAI
  • 2
    OpenAIGPT-5.5Capy
    83.1%OpenAI
  • 3
    OpenAIGPT-5.5Codex CLI
    82.2%OpenAI
  • 4
    ClaudeClaude Opus 4.7WOZCODE
    80.2%Anthropic
  • 5
    GeminiGemini 3.1 Pro PreviewTongAgents
    80.2%Google DeepMind
  • 6
    OpenAIGPT-5.3-CodexSageAgent
    78.4%OpenAI
  • 7
    OpenAIGPT-5.3-CodexDroid
    77.3%OpenAI
  • 8
    ClaudeClaude Opus 4.6Meta-Harness
    76.4%Anthropic
  • 9
    ClaudeClaude Opus 4.6Capy
    75.3%Anthropic
  • 10
    GeminiGemini 3.1 Pro PreviewTerminus-KIRA
    74.8%Google DeepMind
  • 11
    GeminiGemini 3 ProAnte
    69.4%Google DeepMind
  • 12
    OpenAIGPT-5.2 CodexDeep Agents
    66.5%OpenAI
  • 13
    GeminiGemini 3 ProSageAgent
    65.2%Google DeepMind
  • 14
    OpenAIGPT-5.2Droid
    64.9%OpenAI
  • 15
    GeminiGemini 3 FlashJunie CLI
    64.3%Google DeepMind
  • 16
    ClaudeClaude Opus 4.5Droid
    63.1%Anthropic
  • 17
    OpenAIGPT-5.2Codex CLI
    62.9%OpenAI
  • 18
    OpenAIGPT-5.1-Codex-Minihookele
    61.6%OpenAI
  • 19
    GeminiGemini 3.1 Pro PreviewGemini CLI
    61.4%Google DeepMind
  • 20
    OpenAIGPT-5.1-Codex-MaxCodex CLI
    60.4%OpenAI
Sources:
Leaderboard data on Hugging FaceTerminal-Bench 2.1 leaderboardTerminal-Bench 2.0 leaderboardlaude-institute/terminal-bench
Share:
Details:
  • Category


    Coding
  • Laude Institute logoCreated by


    Laude Institute
  • Models tested


    12
  • Configs tested


    20
  • Leader


    OpenAIGPT-5.5
  • Top score


    84.7%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory