MakerPulse
  • About
  • AgentPulse
Sign in Subscribe
Data

AI Can't Agree on What Good Writing Is

05 Mar 2026 3 min read
AI Can't Agree on What Good Writing Is

AI Can't Agree on What Good Writing Is

We use three independent AI evaluators in the AgentPulse v2.2 benchmark: Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2. Different providers, blinded evaluation, identical rubrics. They grade 15 models across 28 prompts with 3 runs each.

On structured tasks, they mostly agree. On creative writing, they fight.

The numbers

We measure evaluator agreement using mean absolute deviation (MAD) across all three pairwise combinations. A MAD of 0.4 means evaluators are typically less than half a point apart on a 5-point scale. A MAD of 0.8 means they're almost a full point apart.

Dimension MAD Agreement
Tone appropriateness 0.398 Strong
Instruction following 0.464 Strong
Completeness 0.478 Strong
AI tell absence 0.492 Decent
Prose craft 0.638 Moderate
Character & dialogue 0.703 Weak
Imagination & risk 0.716 Weak
Accuracy 0.734 Weak
Emotional architecture 0.844 Poor

Read that table from top to bottom. The pattern is clean: the more a dimension requires understanding human experience, the less three frontier AI models can agree on how to score it.

Tone appropriateness (0.398) is basically "did the email sound professional?" They nail that. Instruction following (0.464) is "did you do what was asked?" No trouble. These are pattern-matching problems. AI is good at pattern-matching.

Emotional architecture (0.844) is "did this story make you feel something, and was that feeling earned through craft rather than declared through exposition?" Nearly a full point of disagreement. These evaluators are reading the same story and coming back with fundamentally different scores.

Not all evaluators are equal

The three pairwise disagreement rates tell their own story:

  • Claude-OpenAI: 9.2% disagreement rate (Pearson r = 0.72)
  • Claude-Gemini: 14.1% (r = 0.77)
  • Gemini-OpenAI: 22.0% (r = 0.64)

Gemini and GPT-5.2 disagree on more than one in five creative writing scores. By research standards, a Pearson correlation of 0.64 is borderline. If these were human raters on an academic study, reviewers would ask questions.

There's also a self-bias angle. Claude scores its own provider's models 0.185 points higher than the other evaluators do. Gemini gives Google models a 0.159 boost. But OpenAI is actually harsher on its own models by 0.189 points. The blinding protocol prevents brand bias (evaluators never see model names), but style preferences still leak through.

Why accuracy is surprisingly contentious

You'd expect accuracy to land near the top of the agreement table. Facts are facts. But accuracy's MAD is 0.734, worse than prose craft.

The reason: our accuracy dimension covers two different things depending on the prompt. For extraction tasks with source material, it measures factual fidelity. For open-ended prompts, it measures analytical soundness. "Did you get the facts right?" is objective. "Is your analysis of this ethical dilemma sound?" is not. Same dimension label, very different evaluation problems.

This is one of those things you only discover by measuring evaluator agreement. The dimension looked clean in the rubric. The data said otherwise.

This isn't a benchmark problem

If three frontier AI models can't converge on what makes writing emotionally resonant, that has implications far beyond our benchmark.

Every company using AI to evaluate, filter, or rank creative content has this problem. AI-powered writing assistants that score your drafts? They're scoring against standards that other AI models would disagree with 15-22% of the time. Content platforms using LLMs to assess quality? The "quality" they're measuring shifts depending on which model they chose as their judge.

The disagreement isn't random noise. It's structural. Each model has been trained on different data, with different reward signals, by different teams with different ideas about what "good writing" looks like. When the task is mechanical, those differences don't matter. When the task requires understanding what a reader might feel, the differences surface immediately.

What I take away from this

We've now published three articles using AgentPulse creative data in a row. The first showed that AI writing is detectable even when it's good. The second showed that creative quality is where models drop hardest. This one shows that AI can't even agree on how to measure creative quality.

The thread connects: AI is still far from replicating human emotion. The evaluator disagreement isn't a measurement flaw. It's evidence that these models don't understand what makes creative writing land with a reader. They can approximate structure. They can identify whether instructions were followed. But the thing that separates competent writing from writing that makes you feel something? That's not a pattern-matching problem. It's a human one.

AI is a creative accelerator. I use it constantly. But every time I look at this data, it reinforces something I keep coming back to: without a human actually reading the output and reacting to it as a human, the AI is guessing. And three different guesses diverge by a full point on a five-point scale.

Good writing still needs people. Not because AI can't help. It can. But because the part of writing that matters most is the part AI understands least.

Read next

Your AI Writes Well. Everyone Can Tell.

Your AI Writes Well. Everyone Can Tell.

Llama 4 Maverick scores 3.48 on task quality in our AgentPulse v2.2 benchmark. Respectable. Then you ask it to write something creative, and the score drops to 2.02. A 42% collapse. Same model, same evaluation, different kind of work. It's not alone. We tested 15
Michael Blickenstaff 03 Mar 2026
The Best Coding Model You Can Actually Afford

The Best Coding Model You Can Actually Afford

MiniMax M2.5 scores 80.2% on SWE-bench Verified. Claude Opus 4.6 scores 80.8%. The difference is 0.6 percentage points. The price difference is 17x. At $0.30 per million input tokens and $1.20 per million output tokens, M2.5 is the cheapest model to crack
Michael Blickenstaff 02 Mar 2026
Every AI Writes Like an AI. Some Just Hide It Better

Every AI Writes Like an AI. Some Just Hide It Better

The prose is good now. Really good. In our AgentPulse v2.2 benchmark, which tested 15 models across 28 prompts with 3 runs each, the top four models score between 4.24 and 4.36 on prose craft (out of 5). GPT-5.3-Codex and Claude Sonnet 4.6 tie at
Michael Blickenstaff 01 Mar 2026

Stay ahead of the curve

Weekly benchmark data, tool breakdowns, and honest analysis for AI practitioners. No hype, no fluff.

Topics

  • News
  • Deep Dives
  • Data
  • Tools
  • Building
  • Frontier

MakerPulse

  • About
  • AgentPulse Data

Connect

  • Bluesky
  • GitHub
MakerPulse © 2026