Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/Tau2-Bench Telecom
Tau2-Bench Telecom logo

Tau2-Bench Telecom

Agentic

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

Official source

Each task pairs the agent with an LLM-simulated user in a dual-control environment where both sides can act and use tools, so the agent must guide the user through steps it cannot perform itself (rebooting a phone, toggling settings) while staying inside a written policy document. A run passes only if the final database state and required actions match the ground truth. Telecom is the hardest of the original three domains (retail, airline, telecom) and the one vendors quote, with roughly 114 tasks averaged over repeated trials. The paper also reports pass^k reliability (probability of succeeding on all k independent trials), which falls off sharply for less consistent models. Scores here are vendor-reported pass@1 figures; Sierra's own re-evaluations with a standardized user simulator run several points lower.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Opus 4.6
    99.3%Anthropic
  • 2
    OpenAIGPT-5.4
    98.9%OpenAI
  • 3
    OpenAIGPT-5.2
    98.7%OpenAI
  • 4
    ClaudeClaude Opus 4.5
    98.2%Anthropic
  • 5
    OpenAIGPT-5.5
    98%OpenAI
  • 6
    ClaudeClaude Sonnet 4.6
    97.9%Anthropic
  • 7
    OpenAIGPT-5
    96.7%OpenAI
  • 8
    OpenAIGPT-5.1
    95.6%OpenAI
  • 9
    OpenAIGPT-5.4 Mini
    93.4%OpenAI
  • 10
    OpenAIGPT-5.4 Nano
    92.5%OpenAI
  • 11
    MinimaxMiniMax M2.1
    87%MiniMax
  • 12
    ClaudeClaude Haiku 4.5
    83%Anthropic
  • 13
    NvidiaNemotron 3 Super
    64.4%Nvidia
  • 14
    OpenAIo3
    58.2%OpenAI
Sources:
Official tau-bench leaderboardsierra-research/tau2-benchTau2-Bench paper (arXiv 2506.07982)
Share:
Details:
  • Category


    Agentic
  • Sierra logoCreated by


    Sierra
  • Models tested


    14
  • Leader


    ClaudeClaude Opus 4.6
  • Top score


    99.3%

Updated June 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory