Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Benchmarks
/Berkeley Function Calling Leaderboard V4
Berkeley Function Calling Leaderboard V4 logo

Berkeley Function Calling Leaderboard V4

Agentic

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.

Official source
The board was last updated in April 2026, so the very newest frontier models (ClaudeClaude Fable 5, ClaudeClaude Opus 4.8, OpenAIGPT-5.5) are not on it yet.

BFCL V4 averages a wide set of subcategories into one overall accuracy figure. Single-turn calls (simple, multiple, parallel) are graded by AST matching against possible answers, while multi-turn agentic tasks are graded by executing the calls and checking state. V4 added web search, memory (key-value, vector store, recursive summarization backends) and format sensitivity categories, and hallucination resistance is measured through relevance and irrelevance detection (calling a function when appropriate, abstaining when none fits). Models appear in native function-calling (FC) or prompt mode, recorded on each row; the board also publishes total run cost and latency columns, which are whole-benchmark figures rather than per-task costs, so they are not charted here.

Leaderboard
#ModelScoreProvider
  • 1
    ClaudeClaude Opus 4.5FC
    77.47%Anthropic
  • 2
    ClaudeClaude Sonnet 4.5FC
    73.24%Anthropic
  • 3
    GeminiGemini 3 ProPrompt
    72.51%Google DeepMind
  • 4
    GLM 4.6 logoGLM 4.6FC, Thinking
    72.38%Z.AI
  • 5
    ClaudeClaude Haiku 4.5FC
    68.7%Anthropic
  • 6
    GeminiGemini 3 ProFC
    68.14%Google DeepMind
  • 7
    OpenAIo3Prompt
    63.05%OpenAI
  • 8
    MoonshotAIKimi K2 0711FC
    59.06%Moonshot AI
  • 9
    DeepSeekDeepSeek V3.2 ExpPrompt, Thinking
    56.73%DeepSeek
  • 10
    GeminiGemini 2.5 FlashFC
    56.24%Google DeepMind
  • 11
    OpenAIGPT-5.2FC
    55.87%OpenAI
  • 12
    OpenAIGPT-5 MiniFC
    55.46%OpenAI
  • 13
    DeepSeekDeepSeek V3.2 ExpFC
    54.12%DeepSeek
  • 14
    OpenAIGPT-4.1FC
    53.96%OpenAI
Sources:
Raw leaderboard CSV (data_overall.csv)BFCL V4 leaderboardBFCL V4 announcementBFCL paper (ICML 2025)ShishirPatil/gorilla (BFCL eval code)
Share:
Details:
  • Category


    Agentic
  • UC Berkeley logoCreated by


    UC Berkeley
  • Models tested


    12
  • Configs tested


    14
  • Leader


    ClaudeClaude Opus 4.5
  • Top score


    77.47%

Updated April 2026

Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory