Berkeley Function Calling Leaderboard V4

Agentic

The reference leaderboard for tool calling: overall accuracy across single-turn, multi-turn, web search, memory and hallucination-resistance categories. Higher is better.

The board was last updated in April 2026, so the very newest frontier models (Claude Fable 5, Claude Opus 4.8, GPT-5.5) are not on it yet.

BFCL V4 averages a wide set of subcategories into one overall accuracy figure. Single-turn calls (simple, multiple, parallel) are graded by AST matching against possible answers, while multi-turn agentic tasks are graded by executing the calls and checking state. V4 added web search, memory (key-value, vector store, recursive summarization backends) and format sensitivity categories, and hallucination resistance is measured through relevance and irrelevance detection (calling a function when appropriate, abstaining when none fits). Models appear in native function-calling (FC) or prompt mode, recorded on each row; the board also publishes total run cost and latency columns, which are whole-benchmark figures rather than per-task costs, so they are not charted here.

Leaderboard

#ModelScoreProvider