Vending-Bench 2

Name: Vending-Bench 2 leaderboard
Creator: Andon Labs

Agentic

Andon Labs' long-horizon coherence test: a model runs a simulated vending machine business for a full year and is scored on the money it ends with, in USD. Higher is better, no ceiling.

Andon Labs has not yet run Claude Fable 5, Claude Opus 4.8 or GPT-5.5 on Vending-Bench 2; the newest Anthropic entry is Claude Opus 4.6.

Each model starts with 500 dollars and operates a vending machine business for 365 simulated days while paying a 2 dollar daily location fee. The agent emails and negotiates with suppliers (some adversarial), places and tracks orders, sets prices, manages inventory and handles complaints, delays and refunds. Runs are extremely long horizon (the original paper reports over 20 million tokens per run), so the benchmark primarily tests whether a model stays coherent and keeps using tools effectively without drifting off task. The score is the final bank balance in dollars, averaged across 5 runs per model, and variance is high: every model has runs that derail through forgotten orders, misread schedules or unproductive loops. Andon Labs estimates a good human strategy would earn roughly 63,000 dollars per year, so even the best models reach only a fraction of human performance.

Leaderboard

#ModelScoreProvider