Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/GAIA2
Meta

GAIA2

Agentic

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

Official source

Agents operate in Meta's Agents Research Environments (ARE) mobile universe with 11 working apps (email, calendar, contacts, messaging, shopping, cab, file system) where time flows asynchronously and events fire independently of the agent. The benchmark has 800 human-authored scenarios split across five capabilities: Execution, Search, Adaptability, Time and Ambiguity. Instead of exact-match answers like the original GAIA, an automated verifier checks the agent's write actions against an oracle event graph (0.98 agreement with human labels), so success means producing the right state changes in the right order and on time. Every model runs the same ReAct baseline scaffold over three runs per scenario, which keeps scores attributable to the model rather than a custom harness. 2026 leaderboard rows average the five core splits; paper-baseline rows average seven splits (adding Noise and Agent2Agent) and read a few points lower.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Opus 4.6High, ReAct baseline
    57%Anthropic
  • 2
    OpenAIGPT-5.5xHigh, ReAct baseline
    56.4%OpenAI
  • 3
    OpenAIGPT-5.4High, ReAct baseline
    55.6%OpenAI
  • 4
    GeminiGemini 3.1 Pro PreviewHigh, ReAct baseline
    52%Google DeepMind
  • 5
    ClaudeClaude Sonnet 4.6High, ReAct baseline
    51.9%Anthropic
  • 6
    GLM 5.1 logoGLM 5.1Thinking, ReAct baseline
    50.5%Z.AI
  • 7
    OpenAIGPT-5High, paper baseline
    42.1%OpenAI
  • 8
    ClaudeClaude Sonnet 4Thinking, paper baseline
    37.8%Anthropic
  • 9
    ClaudeClaude Sonnet 4Paper baseline
    34.8%Anthropic
  • 10
    OpenAIGPT-5Low, paper baseline
    34.6%OpenAI
  • 11
    GeminiGemini 2.5 ProPaper baseline
    25.8%Google DeepMind
  • 12
    OpenAIgpt-oss-120bHigh, paper baseline
    13.7%OpenAI
  • 13
    MetaLlama 4 MaverickPaper baseline
    7.4%Meta
Sources:
Gaia2 leaderboard (Hugging Face)gaia2 datasetGaia2 paper (ICLR 2026)facebookresearch/meta-agents-research-environments
Share:
Details:
  • Category


    Agentic
  • MetaCreated by


    Meta
  • Models tested


    11
  • Configs tested


    13
  • Leader


    ClaudeClaude Opus 4.6
  • Top score


    57%

Updated May 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory