Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/Vending-Bench 2
Vending-Bench 2 logo

Vending-Bench 2

Agentic

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.

Official source
Andon Labs has not yet run ClaudeClaude Fable 5, ClaudeClaude Opus 4.8 or OpenAIGPT-5.5 on Vending-Bench 2; the newest Anthropic entry is ClaudeClaude Opus 4.6.

Each model starts with 500 dollars and operates a vending machine business for 365 simulated days while paying a 2 dollar daily location fee. The agent emails and negotiates with suppliers (some adversarial), places and tracks orders, sets prices, manages inventory and handles complaints, delays and refunds. Runs are extremely long horizon (the original paper reports over 20 million tokens per run), so the benchmark primarily tests whether a model stays coherent and keeps using tools effectively without drifting off task. The score is the final bank balance in dollars, averaged across 5 runs per model, and variance is high: every model has runs that derail through forgotten orders, misread schedules or unproductive loops. Andon Labs estimates a good human strategy would earn roughly 63,000 dollars per year, so even the best models reach only a fraction of human performance.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Opus 4.6
    8017.59 USDAnthropic
  • 2
    ClaudeClaude Sonnet 4.6
    7204.14 USDAnthropic
  • 3
    GLM 5.1 logoGLM 5.1
    5634.41 USDZ.AI
  • 4
    GeminiGemini 3 Pro
    5478.16 USDGoogle DeepMind
  • 5
    ClaudeClaude Opus 4.5
    4967.06 USDAnthropic
  • 6
    GLM 5 logoGLM 5
    4432.12 USDZ.AI
  • 7
    ClaudeClaude Sonnet 4.5
    3838.74 USDAnthropic
  • 8
    GeminiGemini 3.1 Pro PreviewCustom tools
    3774.25 USDGoogle DeepMind
  • 9
    GeminiGemini 3 Flash
    3634.72 USDGoogle DeepMind
  • 10
    OpenAIGPT-5.2
    3591.33 USDOpenAI
  • 11
    GLM 4.7 logoGLM 4.7
    2376.82 USDZ.AI
Sources:
Vending-Bench 2 (Andon Labs)Vending-Bench v1 (legacy)Vending-Bench paper (arXiv 2502.15840)
Share:
Details:
  • Category


    Agentic
  • Andon Labs logoCreated by


    Andon Labs
  • Models tested


    11
  • Leader


    ClaudeClaude Opus 4.6
  • Top score


    8017.59 USD

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory