Data

Chinese Models Benchmarked: What We Found

26 Feb 2026 9 min read

The hype around Chinese AI is everywhere right now. Kimi K2.5 generated more revenue in its first 20 days than Moonshot AI earned in all of 2025. Three of the top-token-consuming models on OpenRouter are now Chinese. Anthropic just publicly accused DeepSeek, Moonshot AI, and MiniMax of running an industrial-scale distillation campaign against Claude using roughly 24,000 fake accounts. The discourse is loud and getting louder.

So we actually tested them.

Over the past week, our AgentPulse benchmark ran DeepSeek V3.2, Kimi K2.5, and Qwen3-Max through 28 prompts spanning everyday writing, comprehension, reasoning, professional communication, and creative fiction. Triple-evaluated by Claude, GPT-5.2, and Gemini. The results are more interesting than "pick the Chinese model" framing suggests. Each has a clear fit. Each has a clear limitation. None of them belongs in the same bucket.

The Distillation Allegations

Before the benchmark data, the context: Anthropic's allegations are significant and worth understanding accurately.

According to Anthropic's own published findings, the three companies collectively generated over 16 million exchanges with Claude through approximately 24,000 fraudulently created accounts. MiniMax drove the most traffic at over 13 million exchanges. DeepSeek accounted for more than 150,000 exchanges, apparently targeting foundational logic and alignment with a focus on censorship-safe alternatives. Moonshot AI had more than 3.4 million exchanges targeting agentic reasoning, coding, computer-use development, and computer vision.

Distillation, for context, is when a smaller model learns to mimic a larger one by studying its outputs at scale. It's a known technique in ML, and doing it at this scale without authorization raises obvious questions about both intellectual property and what these models actually learned from.

I'm not going to editorialize beyond the facts here. Anthropic published their methodology at anthropic.com/news/detecting-and-preventing-distillation-attacks. The Register, TechCrunch, and Bloomberg all covered it. The allegations are specific, sourced, and detailed. What they mean for long-term model differentiation is a genuine open question, and one worth sitting with while reading the data below.

The Adoption Numbers

Separate from the allegations, the adoption signals are real.

OpenRouter data published around February 24 shows Chinese models now account for 61% of total token consumption on the platform. MiniMax M2.5 sits at the top with 2.45 trillion tokens in weekly usage. Kimi K2.5 is second with 1.21 trillion tokens. Zhipu AI's GLM-5 is third with 780 billion. DeepSeek V3.2 rounds out the Chinese contingent in fifth position.

Meanwhile, Moonshot AI secured over $700 million in new funding, is reportedly targeting a $12 billion valuation, and has become the fastest Chinese startup ever to reach decacorn status, according to reporting from TechNode and South China Morning Post. The 20-day revenue milestone is real: Kimi's K2.5 generated more in its first 20 days of global availability than the company did in all of 2025, driven by international API usage.

Whether adoption is organic or partially a product of the promotional strategies reported on OpenRouter (Kilo Code offered MiniMax M2.5 free for a week starting February 12), these are the numbers practitioners are seeing on pricing dashboards. They matter.

What We Actually Found

Here's our data from the AgentPulse v2.2 benchmark run completed February 26, 2026. Scores are on a 1-5 scale.

DeepSeek V3.2: The Value Case

Metric	Score
Task score	4.0
Creative score	3.27
Hallucination rate	11% (second-lowest group; GPT-5.2 leads at 4%)
Cost per test run	$0.015
Median latency	48.5s

DeepSeek V3.2 is the cheapest model we've benchmarked by a significant margin. At $0.0146 per 28-prompt run, it costs less than one-tenth of Kimi K2.5 and about one-tenth of Qwen3-Max. If you're running high volumes of structured tasks, that gap compounds fast.

Its track scores tell a clean story: everyday writing (4.26), comprehension (4.05), reasoning (4.05), professional (3.73), creativity (3.94 on constrained tasks). These are solid numbers across the board, with creative writing as the softer spot.

The 11% hallucination rate is notable. That puts DeepSeek among the three models tied for second-lowest in our dataset, behind only GPT-5.2 at 4%. For a model at this price point, the factual reliability is genuinely impressive.

The creative writing tells a more complicated story. On prompt 22, an open-ended literary fiction task, DeepSeek generated a piece called "The Last Frame" set in a projection booth in St. Louis on the night of the Apollo 11 moon landing. The story follows Leo, a 67-year-old projectionist facing the closure of his theater, and Jimmy, his summer hire. Here's the final exchange:

"Leo flipped a switch. The film motor died. But he left the lamp on. The booth was plunged into a sudden, startling silence... The frozen image on the Granada's screen bubbled, darkened, then burst into a small, black hole of nothingness... Leo's grip on Jimmy's wrist tightened, not in restraint, but as if he were trying to anchor himself."

The final line: "Which one," he asked the hot, still air, "is the ghost now?"

That's good writing. Meditative, specific, earned. The score of 3.27 overall creative reflects weaker performance on other creative prompts, not a ceiling. DeepSeek can write; it just doesn't do it consistently.

Best fit: High-volume document processing, summarization, reasoning chains, budget-constrained API applications.

Clear limitation: Creative work is inconsistent. The 48.5s latency is manageable for batch tasks but noticeable for interactive use.

Kimi K2.5: The Performance Case

Metric	Score
Task score	4.44
Creative score	3.73
Hallucination rate	21%
Cost per test run	$0.33
Median latency	138.8s

Kimi K2.5 posted the highest task score of the three Chinese models we tested. Its everyday writing score of 4.73 is exceptional. Comprehension at 4.60, creativity-constrained at 4.45. These are competitive with Anthropic's Claude Opus 4.6 in several tracks.

But the latency is a hard constraint: 138.8 seconds median across 28 prompts. That's the slowest model in our entire benchmark. We tested via OpenRouter, and latency can vary significantly by provider and infrastructure configuration; your experience via the native Kimi API may differ. Still, for any workflow with a human waiting for a response, 139 seconds is outside acceptable bounds.

The 21% hallucination rate deserves attention, particularly for professional tasks where factual errors have real consequences.

The creative fiction holds up better. On prompt 26, the unreliable narrator task, Kimi produced a story set in November Bar Harbor, Maine. A narrator cares for her bedridden sister "Maggie" with increasing obsessiveness. The horror is quiet:

"I took the preserves upstairs on a tray, along with the chamomile tea I brew every afternoon... The room was dim, the curtains drawn tight against the harbor view... I set the jar on the dresser and sat on the edge of the bed. 'Mrs. Gable brought you a gift,' I said."

The final line: "We never let go."

The reader pieces together before the narrator acknowledges it: Maggie is dead, preserved, and the narrator has no visible path back to reality. Subtle, effective, disturbing. Score: 3.82 on the creative track.

Best fit: Offline batch tasks, content generation pipelines where quality matters more than speed, agentic workflows with async task execution.

Clear limitation: Do not use for interactive applications. The 139s latency will be a hard user experience failure for any synchronous interface.

Qwen3-Max: The Specialist Case

Metric	Score
Task score	4.07
Creative score	2.87
Hallucination rate	21%
Cost per test run	$0.15
Median latency	28.2s

Qwen3-Max has the largest task-to-creative score gap in our dataset: 1.20 points. That's not an accident. Alibaba has explicitly positioned Qwen3-Max as a business and technical model. Our data confirms that design choice. It's fast (28.2s median latency), technically capable, and reasonably priced at $0.15 per run.

Its professional track (4.0) and comprehension track (4.24) are strong. Reasoning comes in at 4.19. For structured business tasks, this is a well-rounded model at a competitive price.

The 2.87 creative score initially looks like a weakness. But Qwen's P26 response complicated that interpretation. Its unreliable narrator story, set in Portland, follows Martin, a neighbor convinced he's "looking out for" Eleanor, the young woman in the apartment above his. The reader realizes well before Martin does that he's describing stalking behavior:

"She didn't answer, but I knew she was grateful. She's always been too proud to ask for help, so I've learned to anticipate her needs... I pressed my ear to the door. Heard muffled sobs. My heart clenched. She was clearly in distress... I tried the knob. Unlocked."

The story ends with Martin watching from his window, convinced Eleanor will return:

"And Portland's rain, as I told her that first day, never truly stops. It just waits. Patiently. Like me."

That's technically accomplished. The unreliable narrator device is handled correctly, the reveal is gradual, and the voice is convincingly self-deceived. But our creative rubric penalizes for following genre conventions too closely, lacking imaginative risk. Qwen writes competently; it doesn't often surprise.

I'll say plainly: the 2.87 creative score likely reflects a Western literary aesthetics bias in our rubric as much as it reflects actual output quality. The prompt rewards unconventional structural choices, genre subversion, and stylistic risk. Qwen produces polished, conventional fiction. Those aren't the same thing as poor fiction.

Best fit: Technical documentation, professional communication, reasoning-heavy workflows, anything requiring fast response times with solid comprehension.

Clear limitation: If creative originality matters, pick something else. The 21% hallucination rate also warrants attention for high-stakes factual tasks.

The Distillation Question, Revisited

Here's the tension that's genuinely hard to resolve: if Moonshot AI queried Claude 3.4 million times specifically targeting agentic reasoning and tool use, and Kimi K2.5 now scores exceptionally on agentic tasks, what does that mean?

It doesn't mean Kimi is "Claude in disguise." Distillation doesn't produce a copy; it produces a model with different architecture that has learned behaviors from the teacher's outputs. The resulting model can differ substantially. But it does mean that some of what you're buying with a Kimi K2.5 API call may be, at some level of abstraction, learned from Claude.

For builders, this is more of a philosophical concern than a practical one. The model either handles your task well or it doesn't. But for Anthropic and for the broader question of model differentiation over time, the distillation dynamic is worth watching. If all frontier models can learn each other's capabilities via large-scale querying, the moats get shallower, faster.

How to Pick

Three models, three different decisions:

Use DeepSeek V3.2 if: You're running high-volume structured tasks and cost matters. Its 11% hallucination rate is among the benchmark's best (only GPT-5.2 at 4% does better), latency is acceptable, and at $0.015 per run you can afford to run it often.

Use Kimi K2.5 if: You need top-tier task performance and your workflow is async. Don't put it anywhere a human is waiting for a response. Its 139s latency will frustrate users, but for batch pipelines it's a legitimate choice.

Use Qwen3-Max if: You need fast, professional-grade responses at a price that works. 28.2s latency, solid reasoning, and $0.15 per run put it in a reasonable sweet spot for most business applications.

None of these models belongs on a "Chinese AI" tier list. They belong on a use-case-specific evaluation list, the same one you'd use for any model. Chinese models are no longer a novelty you evaluate out of curiosity. They're production options that deserve the same rigorous task-specific assessment as anything from Anthropic, OpenAI, or Google. The fact that some of their training may have run through Claude's outputs doesn't change that calculation. It just makes the question of long-term model differentiation more interesting to track.

Frequently Asked Questions

How were these models tested?

We used AgentPulse benchmark v2.2, running 28 prompts across five tracks: everyday writing, comprehension/extraction, reasoning/planning, professional communication, and creative writing. Each response was evaluated by a panel of three AI evaluators (Claude, GPT-5.2, and Gemini) using a consistent rubric. Scores are on a 1-5 scale. The full methodology is available at data.makerpulse.ai/agentpulse.

Does the distillation allegation mean these models are unreliable?

Not directly. Distillation affects how a model was trained, not necessarily whether its outputs are accurate or capable. Our benchmark tested outputs independently of training history. The practical implication is more about model differentiation over time than immediate output quality. That said, we flag hallucination rates in every benchmark, and both Kimi K2.5 and Qwen3-Max came in at 21%.

Why is Kimi K2.5 so slow if it scores so high?

Latency and task quality are measured separately. Kimi K2.5 is a large mixture-of-experts model that requires significant compute per inference. At 138.8 seconds median latency in our benchmark (tested via OpenRouter), it's the slowest model in our dataset. Native Kimi API access may be faster; we don't have comparison data yet. The high task scores hold regardless of the delivery speed.

Is the 2.87 creative score for Qwen3-Max actually fair?

Partially. Our creative rubric rewards imaginative risk, unconventional structure, and genre subversion. Qwen3-Max produces polished, conventional fiction that can be genuinely effective (as the Portland unreliable narrator story demonstrates) without scoring highly on originality criteria. If your creative use case values consistent quality over stylistic surprise, the score undersells what Qwen actually delivers.

Are these models available via standard APIs?

Yes. All three are accessible via OpenRouter. DeepSeek V3.2 and Qwen3-Max are also available through their respective native APIs (DeepSeek Platform, Alibaba Cloud Model Studio). Kimi K2.5 is available through the Kimi API. Pricing varies by provider and may differ from our benchmark costs, which were measured via OpenRouter at point of collection.