Flash Matches Sonnet on Tasks. Costs 8x Less.
One-hundredth of a point. That's the task quality gap between Gemini 3 Flash Preview and Claude Sonnet 4.6 in AgentPulse's latest benchmark. Flash scores 4.33 out of 5.00 on structured task prompts; Sonnet scores 4.34. The cost gap is not subtle: $0.063 per run versus $0.494. Flash costs 8x less and responds 4x faster.
That combination warrants serious attention for anyone running AI at volume.
What the benchmark actually measured
AgentPulse v2.2 ran 10 models through 28 prompts across five tracks: everyday writing, comprehension, reasoning, professional tasks, and constrained creativity. The structured task tracks (the first four) are where Flash competes directly with Sonnet. The prompts aren't toy examples. They include multi-constraint emails with specific forbidden phrases, document comprehension under adversarial framing, multi-step logic problems, and professional scenarios with competing obligations.
Flash's track breakdown on task prompts: 4.33 everyday writing, 4.55 comprehension, 4.28 reasoning, 4.03 professional. Sonnet's equivalent: 4.41, 4.34, 4.55, 4.27. Flash actually beats Sonnet on comprehension by 0.21 points. Sonnet edges it on reasoning by 0.27. On professional tasks, Sonnet leads by 0.24 points (4.27 vs 4.03).
Latency: Flash averages 6.6 seconds per response. Sonnet averages 27.4 seconds. For synchronous workflows where users are waiting, that 4x difference isn't aesthetic.
Why you shouldn't trust Google's own benchmark numbers for Flash
Google publishes benchmark results for Gemini models. Those numbers show Flash performing strongly. They're probably right directionally. But AgentPulse's methodology includes a detail that changes how you should read those figures: self-bias detection.
Our benchmark uses three independent evaluators: Claude, Gemini, and OpenAI. Evaluator agreement data shows Gemini scores its own models an average of +0.50 higher than the other evaluators do. That's a statistically significant inflation on just 6 Google-model data points in this run. The effect is consistent: Gemini's mean on its own models is 4.75; other evaluators' mean is 4.25.
Flash's AgentPulse score of 4.33 comes from a methodology that detects and surfaces this bias. It doesn't correct by removing Gemini's votes entirely, but the number you see reflects a multi-evaluator consensus, not a single judge grading its own work. When Google publishes Flash benchmark results, Gemini is doing the evaluation. Our data is more trustworthy for exactly that reason.
The creative gap is real, and it matters for some workflows
Here's where the routing decision gets concrete. Flash's creative score is 3.39. Sonnet's is 4.04. That's a 0.65-point gap across 7 creative prompts: psychological horror, unreliable narrator fiction, tonal comedy, micro-fiction under tight constraints, and others.
This isn't a rounding error. Flash's creative score sits 0.94 points below its own task score. Sonnet's creative score sits only 0.30 points below its task score. That gap-within-gap tells you something: Sonnet handles the mode-shift from structured work to open-ended creative work far more smoothly. Flash doesn't.
What this means practically: if you're building a pipeline that generates marketing copy, product descriptions with brand voice, narrative elements in user-facing products, or anything requiring original register, Sonnet is worth the premium. The $0.431 cost difference per run compounds quickly at scale, but so does producing flat, on-average-acceptable output that no reader would choose.
For everything else, the case for Flash is strong.
Where Flash wins without reservation
High-volume extraction, classification, structured summarization, and document comprehension are Flash's domain. These are tasks where precision against instruction matters more than expressive range. Flash's comprehension track score of 4.55 is the third-highest of any model in the benchmark on that track.
Consider the use cases: pulling structured data from contracts, classifying support tickets, summarizing meeting transcripts to a template, generating consistent metadata at scale. In these scenarios, the 0.01 task score advantage Sonnet holds is statistical noise. The 4x latency difference and 8x cost difference are not.
At 218 output tokens per second, Flash is also the fastest model Google has shipped at this tier. That matters for any synchronous user experience where response time is part of the product.
How to route
My read on the data: Flash for the inbox, Sonnet for the page. Emails, tickets, structured reports, data tasks, form completions: use Flash. Anything a reader would judge on voice, originality, or emotional quality: use Sonnet. The $0.063 vs. $0.494 cost point makes this routing economically significant if you're running thousands of calls per day.
One caveat worth stating plainly: this benchmark covers non-coding text tasks only. Code generation, tool use, and agentic tasks aren't in scope. Don't route those decisions based on this data.
The number to remember: 0.01 points of quality difference, 8x cost difference. If you're not routing on that, you're overpaying.
Data from AgentPulse v2.2 benchmark, 28 prompts, 10 models, triple-evaluator methodology. Full dataset: data.makerpulse.ai/agentpulse/v1/text-models/latest.json.
Frequently Asked Questions
Is the 0.01 task score difference statistically meaningful?
No. The confidence intervals for both models overlap. Flash's task score CI is ±0.21 and Sonnet's is ±0.21. A 0.01 point gap inside those intervals is effectively a tie on structured tasks. The point isn't that Flash is better than Sonnet on tasks; it's that they're indistinguishable at a fraction of the cost.
What makes Flash's comprehension score so high?
Flash scored 4.55 on comprehension prompts, the third-highest in the benchmark on that track. The prompts included multi-document synthesis, adversarial framing designed to trip up models that skim rather than read, and comprehension questions where the wrong answer is superficially plausible. Flash's architecture may be optimized for this kind of retrieval-and-synthesis work, but we don't have methodology insight into why the score is high. It consistently is.
Does Flash's lower creative score mean it produces bad creative output?
A 3.39 out of 5.00 isn't bad. It means Flash produces competent, structurally complete creative responses that score below average on originality, voice distinctiveness, and emotional risk-taking. The evaluation rubric for creative prompts penalizes AI-typical patterns (hedging, predictable structure, safe word choices). Flash scores lower because it hits those patterns more often, not because it fails at the task. For most production uses, "competent and complete" is fine. For anything where the writing quality is the product, the gap matters.
How does Flash compare to the rest of the 10-model benchmark field?
Flash at 4.33 on tasks ranks fifth in the 10-model benchmark, behind GPT-5.2 (4.61), Gemini 3.1 Pro (4.55), Kimi K2.5 (4.44), and Sonnet (4.34). DeepSeek V3.2 and Grok 4.1 Fast both score 4.00 on tasks at lower costs ($0.015 and $0.020 respectively), but the quality gap widens noticeably there. Flash sits at a sweet spot: sub-Sonnet quality at a fraction of the cost, sub-$0.10 cost, sub-10-second latency.
Should I run my own benchmark before switching to Flash?
Yes, if your use case is at all specific. AgentPulse measures general-purpose non-coding text quality across varied prompt types. If your workload is narrow, run 50-100 representative prompts through both models, evaluate the outputs against your actual quality bar, and let the results decide. The benchmark gives you a prior; your data gives you the answer.