How Claude Fable 5 ranks on benchmarks

Anthropic released Claude Fable 5 on June 9. It's the company's first Mythos-class model, priced at $10 in / $50 out per million tokens with a 1M-token context window and built for long-running autonomous work. Here is where it lands, sourced from Anthropic's announcement and the independent CursorBench leaderboard.

CursorBench 3.1

CursorBench evaluates models on ambiguous, multi-file tasks taken from real Cursor sessions. It's the closest thing we have to a production agentic-coding benchmark.

Fable 5 high (default): 70.6% at $10.81 per task, more than 7 points clear of every other default configuration.
Fable 5 Max: 72.9%, the top score on the whole leaderboard.
Next-best defaults: Cursor's Composer 2.5 at 63.2% ($0.55 per task, the value outlier), GPT-5.5 high at 62.6%, Claude Opus 4.8 high at 58.4%.

Full interactive leaderboard on our CursorBench page.

Artificial Analysis Intelligence Index

Artificial Analysis publishes a composite 0-100 intelligence score that blends knowledge, reasoning, math, coding, and agentic evaluations. It is the most widely cited all-up benchmark outside vendor tables.

Fable 5 (default): 64.9, the top score in our catalog, about 7 points above Claude Opus 4.7 (57.3) and Gemini 3.1 Pro Preview (57.2).
Next in this cut: Qwen3.7 Max (56.6), Gemini 3.5 Flash (55.3), MiniMax-M3 (54.7), Grok 4.3 high (53.2).

Full interactive leaderboard on our Intelligence Index page.

Anthropic's reported numbers

From the announcement, against Claude Opus 4.8, GPT-5.5 and Gemini 3.1 Pro, with the best score per row highlighted:

	Claude Mythos 5 / Fable 5	Claude Mythos Preview	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
Agentic coding SWE-Bench Pro	80.3%	77.8%	69.2%	58.6%	54.2%
Agentic coding FrontierCode (Diamond), xhigh	29.3%	—	13.4%	5.7%	—
Agentic coding Terminal-Bench 2.1	88.0%*	—	82.7%	83.4% Codex CLI	70.7% Gemini CLI
Knowledge work GDPval-AA	1932	—	1890	1769	1314
Knowledge work vision GDP.pdf, no tools	29.8%	—	22.5%	24.9%	16.7%
Spatial reasoning Blueprint-Bench 2	38.6%	—	14.5%	36.2%	26.5%
Tool use AutomationBench	17.4%	—	15.5%	12.9%	9.6%
Computer use OSWorld-Verified	85.0%	85.4%	83.4%	78.7%	76.2%
Legal Legal Agent Benchmark	13.3%	—	10.4%	2.1%	0.0%
Multidisciplinary reasoning Humanity's Last Exam, no tools	59.0%*	56.8%	49.8%	41.4%	44.4%
Multidisciplinary reasoning Humanity's Last Exam, with tools	64.5%*	64.7%	57.9%	52.2%	51.4%
Biology BioMysteryBench, hard	46.1%*	29.6%	40.0%	—	—
Biology BioMysteryBench, human solved	83.9%*	82.6%	80.4%	—	—
Cybersecurity ExploitBench (Cap)	78.0%*	69.0%	40.0%	34.0%	—
Health HealthBench Professional	66.0%*	64.7%	56.9%	51.8%	—

Anthropic reports the higher score of Claude Mythos 5 and Claude Fable 5; the two land within 1-3 points of each other except on starred (*) benchmarks. See the Mythos caveat below.

The Mythos caveat

Anthropic reports the higher score of two models: Claude Mythos 5 (the identical model with safeguards lifted, restricted to vetted researchers) and the generally available Fable 5. The two land within 1-3 points of each other except on starred (*) benchmarks, where Fable 5's safeguards redirect cybersecurity and biology queries to Opus 4.8 (under 5% of sessions). On those, Fable 5's effective score sits closer to Opus 4.8.

Bottom line

At twice Opus 4.8's price, Fable 5 is not the default for everything. But on long-horizon agentic coding it is currently the strongest model available, and the CursorBench cost curve shows the premium buying capability, not just tokens. Pricing, host availability, and sources on the Claude Fable 5 model page.

Sources

Anthropic: Introducing Claude Fable 5 and Claude Mythos 5 for the announcement, pricing, and the reported benchmark table
CursorBench leaderboard for independent agentic-coding scores and per-task cost
CursorBench 3.1 on Agents Directory for the live leaderboard we keep updated
Artificial Analysis Intelligence Index for the composite intelligence methodology
Intelligence Index on Agents Directory for the live leaderboard we keep updated
Claude Fable 5 model page for pricing, context window, and host availability

CursorBench 3.1

CursorBench evaluates models on ambiguous, multi-file tasks taken from real Cursor sessions. It's the closest thing we have to a production agentic-coding benchmark.

Fable 5 high (default): 70.6% at $10.81 per task, more than 7 points clear of every other default configuration.

Fable 5 Max: 72.9%, the top score on the whole leaderboard.

Next-best defaults: Cursor's Composer 2.5 at 63.2% ($0.55 per task, the value outlier), GPT-5.5 high at 62.6%, Claude Opus 4.8 high at 58.4%.

Full interactive leaderboard on our CursorBench page.

Artificial Analysis Intelligence Index

Fable 5 (default): 64.9, the top score in our catalog, about 7 points above Claude Opus 4.7 (57.3) and Gemini 3.1 Pro Preview (57.2).

Next in this cut: Qwen3.7 Max (56.6), Gemini 3.5 Flash (55.3), MiniMax-M3 (54.7), Grok 4.3 high (53.2).

Full interactive leaderboard on our Intelligence Index page.

Anthropic's reported numbers

From the announcement, against Claude Opus 4.8, GPT-5.5 and Gemini 3.1 Pro, with the best score per row highlighted:

	Claude Mythos 5 / Fable 5	Claude Mythos Preview	Claude Opus 4.8	GPT-5.5	Gemini 3.1 Pro
Agentic coding SWE-Bench Pro	80.3%	77.8%	69.2%	58.6%	54.2%
Agentic coding FrontierCode (Diamond), xhigh	29.3%	—	13.4%	5.7%	—
Agentic coding Terminal-Bench 2.1	88.0%*	—	82.7%	83.4% Codex CLI	70.7% Gemini CLI
Knowledge work GDPval-AA	1932	—	1890	1769	1314
Knowledge work vision GDP.pdf, no tools	29.8%	—	22.5%	24.9%	16.7%
Spatial reasoning Blueprint-Bench 2	38.6%	—	14.5%	36.2%	26.5%
Tool use AutomationBench	17.4%	—	15.5%	12.9%	9.6%
Computer use OSWorld-Verified	85.0%	85.4%	83.4%	78.7%	76.2%
Legal Legal Agent Benchmark	13.3%	—	10.4%	2.1%	0.0%
Multidisciplinary reasoning Humanity's Last Exam, no tools	59.0%*	56.8%	49.8%	41.4%	44.4%
Multidisciplinary reasoning Humanity's Last Exam, with tools	64.5%*	64.7%	57.9%	52.2%	51.4%
Biology BioMysteryBench, hard	46.1%*	29.6%	40.0%	—	—
Biology BioMysteryBench, human solved	83.9%*	82.6%	80.4%	—	—
Cybersecurity ExploitBench (Cap)	78.0%*	69.0%	40.0%	34.0%	—
Health HealthBench Professional	66.0%*	64.7%	56.9%	51.8%	—

Anthropic reports the higher score of Claude Mythos 5 and Claude Fable 5; the two land within 1-3 points of each other except on starred (*) benchmarks. See the Mythos caveat below.

The Mythos caveat

Bottom line

How Claude Fable 5 ranks on benchmarks

Anthropic's new Mythos-class model tops CursorBench 3.1 and posts the strongest agentic-coding scores reported so far. The numbers, with the one caveat that matters.

CursorBench 3.1

Artificial Analysis Intelligence Index

Anthropic's reported numbers

The Mythos caveat

Bottom line

Sources

How Claude Fable 5 ranks on benchmarks

Anthropic's new Mythos-class model tops CursorBench 3.1 and posts the strongest agentic-coding scores reported so far. The numbers, with the one caveat that matters.

CursorBench 3.1

Artificial Analysis Intelligence Index

Anthropic's reported numbers

The Mythos caveat

Bottom line

Sources