28 Real Tasks Reveal What AI Leaderboards Miss
4.61 versus 4.55.
That's the gap between the top two models in our first AgentPulse benchmark run: GPT-5.2 and Gemini 3.1 Pro, separated by six hundredths of a point on task quality, scored by three independent AI evaluators across 28 real-world prompts. One costs $0.74 to run the full suite. The other costs $1.61. A third model, Claude Opus 4.6, sits at 4.30 but finishes in about two-thirds the time, and at less than half the latency of the most expensive option. And a speed-tier model from xAI that nobody is talking about costs two cents for the entire run while scoring within striking distance of models costing 30-80x more.
These aren't the numbers you'll find on any company's marketing page. They're from AgentPulse, a benchmark we built specifically because no existing evaluation answers the question practitioners actually ask: which model should I use for the work I'm doing right now?
Why We Built This
Every major AI lab publishes benchmark scores. MMLU, HumanEval, MATH, GPQA. If you follow frontier model releases, you've seen the charts: a new model launches, it tops a leaderboard, the marketing team celebrates. Two weeks later another model tops the same leaderboard. The cycle repeats.
The problem is that none of those benchmarks measure the work most AI-native builders actually do. Writing emails to difficult stakeholders. Extracting structured data from messy documents. Planning a trip under hard budget constraints. Turning rambling meeting notes into clear action items. Writing a cover letter that doesn't sound like a template. These are the tasks that eat real hours, and no standardized benchmark tests them.
So we built one.
AgentPulse v2.2 runs 28 prompts across 6 tracks: everyday writing, comprehension and extraction, reasoning and planning, professional communication, constrained creativity, and open-ended creative writing — where the model rankings shift dramatically. Each prompt is designed to test a specific failure mode. The "Bad News Email" prompt forces models to handle shared blame between competing interests. The "Extract Structured Data" prompt plants ambiguity traps where obvious answers are wrong. The "Tokyo Trip Planning" prompt sets a hard budget that requires genuine arithmetic, not just vibes.
Three independent evaluators (Claude Opus 4.6, Gemini 3.1 Pro Preview, and GPT-5.2) score every response on a blind basis, meaning no evaluator knows which model produced the response it's scoring. Their mean pairwise inter-rater reliability hits a Pearson r of 0.7055, with the strongest agreement between Claude and Gemini (r = 0.80) and the weakest between Gemini and OpenAI (r = 0.61). We run self-bias detection on all three: none of the evaluators systematically favor their own provider's models beyond our 0.3-point threshold. The methodology and raw data are open source at github.com/Arithrix/agentpulse-data.
The Rankings Tell One Story. The Gaps Tell Another.
Here's the high-level picture across all five models on task quality (scored 1-5, averaged across 21 non-creative prompts):
| Model | Task Score | Creative Score | Avg Latency | Total Cost |
|---|---|---|---|---|
| GPT-5.2 | 4.61 | 3.94 | 45.0s | $0.74 |
| Gemini 3.1 Pro | 4.55 | 3.39 | 68.3s | $1.61 |
| Claude Opus 4.6 | 4.30 | 3.95 | 30.3s | $0.82 |
| Grok 4.1 Fast | 4.00 | 2.96 | 23.9s | $0.02 |
| Mistral Large 2512 | 3.76 | 2.67 | 14.7s | $0.04 |
The first thing that jumps out: GPT-5.2 leads on task quality, but the gap to Gemini 3.1 Pro is only 0.06 points. Both models' 95% confidence intervals overlap substantially (GPT-5.2 at plus or minus 0.18, Gemini at plus or minus 0.16). On most individual tasks, you wouldn't notice the difference.
Claude Opus 4.6 sits further back at 4.30, a 0.31-point gap from GPT-5.2. That's still within overlapping confidence intervals (Claude's is plus or minus 0.19), but it's a meaningful gap. The top two have separated from the pack.
The second thing: there's a tier break below Claude. The 0.30-point gap from Claude (4.30) to Grok (4.00) is comparable to the 0.31-point spread across the entire top three. Below Grok, Mistral Large at 3.76 sits in its own tier. For work where quality is the primary constraint, the top three are your shortlist, but GPT-5.2 and Gemini are the safest bets on raw task execution.
There's a growing consensus among AI researchers that newer models are less sensitive to prompting technique than older ones. Our data adds a layer to that: the spread within a single model's scores across 28 prompts is wider than the spread between models on most individual tasks. In other words, how you prompt matters more than which top model you pick.
Where Each Model Pulls Ahead
The aggregate scores hide the interesting part. When you break results down by track, clear specializations emerge.
Reasoning and planning is where GPT-5.2 takes an unambiguous lead. It scored 4.75 on the four reasoning prompts, with Claude Opus 4.6 at 4.62 and Gemini 3.1 Pro at 4.50. But the telling detail is what happened on the hardest prompt in the set: "Decision Analysis (Hidden Variable)," which asks the model to evaluate a business decision where the most important factor isn't stated in the prompt. GPT-5.2 scored 4.82. Claude and Gemini both scored 4.47. All three caught the hidden variable, but GPT-5.2's treatment was the most thorough.
On the prioritization prompt, which requires recognizing interdependencies between tasks rather than just ranking them by urgency, GPT-5.2 scored 4.85 and Claude scored 4.62. Gemini was close at 4.52. Mistral and Grok both dropped noticeably, scoring 4.33 and 4.27 respectively.
Everyday writing tells a different story. Gemini 3.1 Pro leads at 4.63, with GPT-5.2 right behind at 4.61 and Claude at 4.33. The gap widens on specific prompts. On the "Thank-You Note (Authenticity Test)," which penalizes generic, template-sounding language, GPT-5.2 and Claude both scored 4.83. Gemini scored 4.78. Grok scored 4.67. The spread here is tight enough that personal preference matters more than statistical differences.
But there's a prompt where the models genuinely diverge. "Social Media Posts (Multi-format)" asks for content across multiple platforms with platform-specific constraints. Gemini scored 4.33. Claude scored 4.00. GPT-5.2 scored 3.94. This is a task with hard structural requirements (character limits, format conventions), and models that are more verbose by default struggle with constraints.
Creative writing is where the real surprises live. Claude Opus 4.6 leads with a creative score of 3.95, with GPT-5.2 nearly matching at 3.94. Gemini sits at 3.39. That's a meaningful gap below the top two, though narrower than you might expect given Gemini's task dominance. On the seven open-ended creative prompts (literary fiction, science fiction, horror, comedy, unreliable narrator, micro-fiction), Gemini's scores ranged from 2.63 to 4.03. Its science fiction story scored 3.05. Its micro-fiction triptych scored 2.63. For a model that's competitive on task quality, the creative gap is notable.
I'll go deeper on the creative results in a dedicated piece. For now, the headline is clear: if your work involves creative writing, Claude Opus and GPT-5.2 are the only two models in this lineup that consistently produce strong creative output. But even the leaders have room to grow, with neither breaking 4.0 on the creative composite.
The Two-Cent Contender
Grok 4.1 Fast deserves its own section, and a note on what it is. This isn't xAI's flagship model. It's their speed-optimized tier, designed for high-throughput use cases where latency and cost matter more than peak quality. We included it specifically to test how far a fast-tier model can stretch on real tasks.
The cost numbers are almost absurd. Running all 28 prompts through Grok costs $0.02. That's not a typo. Two cents for a full benchmark suite that costs $1.61 on Gemini and $0.74 on GPT-5.2.
Its task score of 4.00 sits below the top three, but consider what that means in practice. On the "Explain HTTPS to Non-Expert" prompt, Grok scored 4.42 versus Gemini's 4.79. On "Meeting Notes to Action Items," Grok scored 4.55 versus GPT-5.2's 4.77. On the "Domain-Locked Analogy" creativity prompt, Grok scored 4.44 versus Claude's 3.83. There are specific tasks where Grok matches or beats premium models.
The gap between speed-tier and flagship models has been closing with each generation, and our data puts a number on it: Grok 4.1 Fast is roughly 11% behind the top-3 average on task quality, at 2-3% of the cost. A solo developer running 50 API calls a day for draft generation would spend about $0.04/day on Grok versus $2.88/day on Gemini. For workflows where you're running multiple iterations before a human reviews the output, the economics are hard to ignore.
The catch is reliability. Grok's confidence interval (plus or minus 0.28) is significantly wider than GPT-5.2's (plus or minus 0.18) or Gemini's (plus or minus 0.16). It's more variable across prompts. On its best prompts it's competitive with premium models. On its worst, it drops to scores in the low 3s. If consistency matters as much as peak performance, the premium models justify their price.
What the Hallucination Data Shows (and Doesn't)
All five models hallucinated at least once during the benchmark run. Mistral Large had the highest rate at 21%, fabricating content on 6 of the 28 prompts. Claude Opus 4.6 was next at 18% (5 prompts), followed by Gemini 3.1 Pro and Grok 4.1 Fast each at 11% (3 prompts). GPT-5.2 had the lowest rate at 4% (1 prompt).
I want to be precise about what this means and what it doesn't. Our hallucination detection uses a three-evaluator consensus where any evaluator can flag fabricated content, combined with automated factual pre-checks. The 28-prompt sample is small enough that a single hallucination shifts the rate by 3.6 percentage points. I wouldn't draw strong conclusions about any individual model's hallucination tendency from this dataset alone.
What I can say is that the addition of a third evaluator caught hallucinations that two evaluators missed, particularly on prompts requiring real-world knowledge. Mistral's 21% rate and Claude's 18% rate are both high enough to flag for production use cases where factual accuracy is non-negotiable. GPT-5.2's single hallucination across 28 prompts is the strongest showing, but one run doesn't make it hallucination-proof.
The Tokyo Trip Planning prompt was the hallucination hot spot, consistent with what we found in our earlier v2.1 testing. Models fabricate restaurant names, cite attractions that have closed, and invent transit routes. If you're using AI for anything involving real-world facts, verifying the output isn't optional.
What This Means for Practitioners
Skip the leaderboard thinking. Here's my read on how to actually use this data.
If quality is your only constraint: GPT-5.2 leads on task quality and has the lowest hallucination rate. Gemini 3.1 Pro is close behind on tasks. Both are strong choices. Claude Opus 4.6 is further back on task execution but leads on creative output. Try all three on your specific tasks and pick the one whose output style you prefer editing.
If you need fast turnaround: Claude Opus 4.6 averages 30 seconds per prompt, about a third faster than GPT-5.2 (45s) and less than half the latency of Gemini (68s). For interactive workflows where you're iterating in real time, latency is a feature, not a spec line.
If budget matters: Grok 4.1 Fast at $0.02 per full suite is a legitimate option for draft generation, bulk processing, and any task where a human reviews the output before it ships. Don't use it for final-draft quality work without review.
If you write creatively: Claude Opus (3.95 creative score) and GPT-5.2 (3.94) are the clear top two. Gemini's creative output drops below its task performance. This is the single biggest divergence in the dataset. On factual accuracy, GPT-5.2 had the lowest hallucination rate at 4%. That doesn't make it hallucination-proof, but it's a data point worth weighing.
What's Coming Next
This is the first AgentPulse benchmark publication, not the last. We're running these monthly and publishing the data openly. The v2.2 methodology document, evaluation rubrics, and prompt specifications are all in the open-source repository.
Future analyses will cut the data differently. Track-by-track deep dives on where models genuinely diverge. Cost-efficiency frontier analysis for builders optimizing spend. Evaluator disagreement patterns that reveal which types of quality are subjective versus measurable. Hallucination analysis by task type. We'll also add models as they launch; the Chinese frontier models (Qwen, DeepSeek, Kimi) and upcoming releases from Anthropic, Google, and OpenAI will all run through the same pipeline.
We're also launching a companion study on prompt architecture: holding the model constant and varying prompt strategies to measure the quality difference. The benchmark tells you which model to pick. The prompt study will tell you how to get better output from whatever model you choose.
The full dataset is live at data.makerpulse.ai/agentpulse/v1/text-models/latest.json. Download it, slice it, challenge our methodology. That's the point.
Six hundredths of a point separate the top two models on task quality. But the real question was never "which model is best." It's "best at what, for how much, and how fast." Now there's data to answer it.