Agents Directory
SkillsRankingsAgents
CategoriesModelsBenchmarksCompareAgent LeaderboardSkillsRankingsAgentsAbout
/Compare
/Claude Sonnet 4.6 vs GPT-5.4

Claude Sonnet 4.6 vs GPT-5.4

The workhorse tier, where most real coding spend actually goes: Claude Sonnet 4.6 against GPT-5.4. Neither is the headline flagship; both are what teams quietly run all day.

The verdict

OpenAIGPT-5.4 is the better value of the two workhorses: $2.50 per million input tokens versus Sonnet 4.6's $3, with a stronger showing on our boards (56.8 versus 51.7 on the Intelligence Index at best settings, and a standout 56% on DeepSWE logoDeepSWE).

Sonnet 4.6 remains the dependable pick inside the Anthropic ecosystem: if your agent, skills, and workflows are Claude-based, it is the model you run when Opus or Fable is overkill, and its quality on everyday edits is well-proven.

If you are choosing fresh with no ecosystem pull, OpenAIGPT-5.4 wins this tier. If you are already on Claude, Sonnet 4.6 is close enough that switching costs are not worth it.

The facts, side by side
ClaudeClaude Sonnet 4.6OpenAIGPT-5.4
ProviderClaudeAnthropicOpenAIOpenAI
Input price$3/M / 1M tokens$2.5/M / 1M tokens
Output price$15/M / 1M tokens$15/M / 1M tokens
Context1M tokens1.1M tokens
Open weightsNoNo
Free tierNoNo
ReleasedFeb 2026Mar 2026

Prices and context are synced from live provider listings. Deep dives: ClaudeClaude Sonnet 4.6 and OpenAIGPT-5.4.

Benchmark scores
Claude Sonnet 4.6GPT-5.4
Vending-Bench 2 logoVending-Bench 27204.14 USD—
DDesign Arena1325 Elo (Code)—
Tau2-Bench Telecom logoTau2-Bench Telecom97.9%98.9%
OSWorld-Verified logoOSWorld-Verified81.45% (Pointer Agent, 100 steps)75% (Vendor harness)
SWE-bench Verified logoSWE-bench Verified79.6% (Vendor harness)—
OpenAIBrowseComp74.7% (Max thinking, tools)82.7% (Browsing)
MetaGAIA251.9% (High, ReAct baseline)55.6% (High, ReAct baseline)
CursorCursorBench 3.149% (Max)—
Artificial Analysis Intelligence Index logoArtificial Analysis Intelligence Index47.2 (Adaptive Reasoning, Max Effort)51.4 (xhigh)
DeepSWE logoDeepSWE32% (High)56% (Extra High)
FrontierCode Main logoFrontierCode Main15.1%—
FrontierCode Diamond logoFrontierCode Diamond3.5%—
METR 50% Time Horizon logoMETR 50% Time Horizon—341.74 min
SWE-bench Pro logoSWE-bench Pro—59.1% (xHigh)

Best published configuration per model. Every config and source is on the benchmark leaderboards.

Benchmarks, head to head

Every published configuration for Claude Sonnet 4.6 and GPT-5.4 on the benchmarks they share, charted side by side. Only these two models are plotted.

DeepSWE logoDeepSWE

Datacurve's agentic coding benchmark: each model runs as an autonomous agent on real software engineering tasks and is scored on whether its final patch resolves the issue. Higher is better.

Tau2-Bench Telecom logoTau2-Bench Telecom

Sierra's customer-support agent benchmark: resolve telecom troubleshooting tasks by talking to a simulated user, calling tools and strictly following policy. Pass@1, higher is better.

OSWorld-Verified logoOSWorld-Verified

The standard computer-use benchmark: agents complete real desktop tasks in a live Ubuntu VM from screenshots, mouse and keyboard, scored by execution-based checks. Higher is better.

MetaGAIA2

Meta's general-assistant benchmark: agents complete realistic multi-app tasks in a simulated smartphone environment under time pressure, ambiguity and dynamic events. Higher is better.

OpenAIBrowseComp

OpenAI's hard web-browsing benchmark: 1,266 questions whose answers are hard to find but easy to verify, requiring persistent multi-step browsing. Higher is better.

Artificial Analysis Intelligence Index logoArtificial Analysis Intelligence Index

The most-cited composite intelligence score: a 0–100 index combining knowledge, reasoning, math, coding, and agentic evaluations (GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench, and more). Higher is better.

Frequently asked questions
Is GPT-5.4 better than Claude Sonnet 4.6?

By our boards, modestly yes: GPT-5.4 scores higher on the Intelligence Index (56.8 versus 51.7 at best settings) and costs less ($2.50 versus $3 per million input tokens). Sonnet 4.6 stays competitive on everyday coding quality and is the natural pick if you are already in the Claude ecosystem.

Should I use a workhorse model or a flagship for coding agents?

Workhorses for the everyday 90%: models like GPT-5.4 and Sonnet 4.6 handle routine edits, tests, and refactors at a third of flagship prices. Route the genuinely hard tasks (architecture, gnarly debugging, long autonomous runs) to a flagship. Most agent setups support per-task model switching, so use it.

Share:
Details:
  • Type


    Model comparison
  • Claude Sonnet 4.6


    Model page
  • GPT-5.4


    Model page
  • Updated


    June 2026
Browse:SkillsRankingsModelsBenchmarksProvidersAgentsAgent LeaderboardCompareCategories
Quick Links:AboutBlog

© 2026 Agents Directory