GAIA2

Name: GAIA2 leaderboard
Creator: Meta

Agentic

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

Agents operate in Meta's Agents Research Environments (ARE) mobile universe with 11 working apps (email, calendar, contacts, messaging, shopping, cab, file system) where time flows asynchronously and events fire independently of the agent. The benchmark has 800 human-authored scenarios split across five capabilities: Execution, Search, Adaptability, Time and Ambiguity. Instead of exact-match answers like the original GAIA, an automated verifier checks the agent's write actions against an oracle event graph (0.98 agreement with human labels), so success means producing the right state changes in the right order and on time. Every model runs the same ReAct baseline scaffold over three runs per scenario, which keeps scores attributable to the model rather than a custom harness. 2026 leaderboard rows average the five core splits; paper-baseline rows average seven splits (adding Noise and Agent2Agent) and read a few points lower.

Leaderboard

#ModelScoreProvider