Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/OSWorld-Verified
OSWorld-Verified logo

OSWorld-Verified

Agentic

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

Official source

OSWorld defines 369 real desktop tasks spanning Chrome, GIMP, LibreOffice, VLC, VS Code, Thunderbird, OS settings and multi-app workflows. Agents observe raw screenshots and act with mouse and keyboard, and scoring is execution based: per-task scripts verify the final system state rather than judging text. The Verified revision fixed about 300 task and infrastructure issues, and the team reruns submissions under unified settings at fixed step budgets (15, 50 and 100 steps), since the same agent can gain 5 to 10 points with a larger budget. Human performance is estimated around 72%, a bar the best 2026 entries exceed. Official rerun rows and vendor self-reported rows are both listed, with the scaffold or harness named on each row.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Fable 5Vendor harness
    85%Anthropic
  • 2
    ClaudeClaude Opus 4.7Pointer Agent, 100 steps
    83.64%Anthropic
  • 3
    ClaudeClaude Opus 4.8Vendor harness
    83.4%Anthropic
  • 4
    ClaudeClaude Opus 4.7Vendor harness, updated methodology
    82.3%Anthropic
  • 5
    ClaudeClaude Sonnet 4.6Pointer Agent, 100 steps
    81.45%Anthropic
  • 6
    OpenAIGPT-5.5Vendor harness
    78.7%OpenAI
  • 7
    GeminiGemini 3.5 FlashVendor harness
    78.4%Google DeepMind
  • 8
    GeminiGemini 3.1 Pro PreviewOpenAPA, 100 steps
    78.34%Google DeepMind
  • 9
    MinimaxMiniMax-M3100 steps
    75.19%MiniMax
  • 10
    OpenAIGPT-5.4Vendor harness
    75%OpenAI
  • 11
    MoonshotAIKimi K2.6100 steps
    73.06%Moonshot AI
  • 12
    ClaudeClaude Sonnet 4.6Native, 100 steps
    72.11%Anthropic
  • 13
    ClaudeClaude Sonnet 4.5Native, 100 steps
    62.88%Anthropic
  • 14
    UUI-TARS 7B100 steps
    29.6%ByteDance
Sources:
OSWorld-Verified results file (xlsx)OSWorld official leaderboardIntroducing OSWorld-VerifiedOSWorld paper (arXiv 2404.07972)xlang-ai/OSWorld
Share:
Details:
  • Category


    Agentic
  • XLANG Lab logoCreated by


    XLANG Lab
  • Models tested


    12
  • Configs tested


    14
  • Leader


    ClaudeClaude Fable 5
  • Top score


    85%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory