MakerPulse
  • About
  • AgentPulse
Sign in Subscribe
Data

The Best Coding Model You Can Actually Afford

02 Mar 2026 3 min read
The Best Coding Model You Can Actually Afford

MiniMax M2.5 scores 80.2% on SWE-bench Verified. Claude Opus 4.6 scores 80.8%. The difference is 0.6 percentage points. The price difference is 17x.

At $0.30 per million input tokens and $1.20 per million output tokens, M2.5 is the cheapest model to crack the 80% SWE-bench barrier. For builders running AI-assisted coding at any kind of scale, that math changes decisions.

Who is MiniMax?

MiniMax is a Shanghai-based AI lab founded in 2021 by former SenseTime researchers. Their early backers included MiHoYo (the studio behind Genshin Impact), and they raised $600 million from Alibaba in 2024. They went public on the Hong Kong Stock Exchange in January 2026. Most people know them through Hailuo AI, their video generation product, but their language models have been climbing benchmarks quietly for months.

M2.5 uses a Mixture-of-Experts architecture: 229 billion total parameters, but only 10 billion activate per token. That extreme sparsity is the whole pricing story. You get frontier-class coding performance while paying for a model one-twentieth the size.

The numbers

Here's how M2.5 stacks up against the models builders are actually choosing between:

Model SWE-bench Verified Input/1M Output/1M Context
Claude Opus 4.6 80.8% $5.00 $25.00 200K
Gemini 3.1 Pro 80.6% $2.00 $12.00 1M
MiniMax M2.5 80.2% $0.30 $1.20 200K
GLM-5 (open source) 77.8% $1.00 $3.20 200K

M2.5 also comes in a Lightning variant (called "highspeed" in the API) that doubles throughput to roughly 100 tokens per second for $0.30/$2.40. Still cheaper than everything else on the table.

The model is open-weight under a Modified MIT license, with weights on Hugging Face. You can self-host it via vLLM or SGLang, or access it through OpenRouter, Fireworks, DeepInfra, NVIDIA NIM, and MiniMax's own API.

What the caveats are

Benchmark numbers require context. Here's what you should know before switching.

The SWE-bench score used Claude Code as scaffolding. MiniMax's 80.2% was measured using Claude Code's agent framework with the default system prompt overridden, averaged over four runs. On other scaffoldings like Droid (79.7%) and OpenCode (76.1%), scores drop a few points. The number is real, but it reflects M2.5 running inside a specific agent setup, not M2.5 in isolation.

Hallucination rates went up, not down. Artificial Analysis measured an 88% hallucination rate on their Omniscience benchmark, up from 67% on the predecessor M2.1. If your use case involves factual accuracy rather than code generation, that's a real problem.

General reasoning still trails frontier models. M2.5 scores 42 on the Artificial Analysis Intelligence Index, behind GLM-5 (50), Kimi K2.5 (47), and well behind Opus and Gemini. On GPQA-Diamond, which tests graduate-level reasoning, M2.5 hits 85.2% while Gemini 3.1 Pro scores 94.3%. This is a coding specialist, not a general-purpose model.

The license has a branding catch. The Modified MIT license requires you to prominently display "MiniMax M2.5" in the UI of any commercial product using the model. That's more restrictive than Apache 2.0 and could matter for white-label or embedded use cases.

The architecture hasn't changed. M2.5 is structurally identical to M2, which shipped in October 2025. Same parameter count, same expert routing, same activation pattern. All the improvement comes from post-training via reinforcement learning. That's encouraging for what RL can extract from a fixed architecture, but it means the ceiling is the same.

Where it fits

The honest recommendation: M2.5 is compelling for coding workloads where you're already running agent scaffolding and cost is a real constraint. If you're spinning up AI-assisted code generation at volume, paying $1.20 per million output tokens instead of $25.00 adds up fast. At those prices, you can afford to run more iterations, more candidates, more test-and-retry loops.

For anything requiring factual reliability, long-form reasoning, or creative quality, look elsewhere. The hallucination regression alone rules it out for production use cases where accuracy matters more than cost.

We're testing it ourselves

We're adding MiniMax M2.5 to the AgentPulse benchmark in our next evaluation run. The benchmark tests 28 prompts across five tracks with three independent runs per model and a triple-evaluator grading panel. We'll see how the $0.30 model holds up on our own methodology rather than SWE-bench alone.

If the coding performance translates to broader task quality at that price point, M2.5 changes the cost calculus for a lot of builder workflows. If it doesn't, you'll have the data to know exactly where it falls short.

MiniMax M2.5 is available now via OpenRouter, MiniMax API, and Hugging Face for self-hosting.

Read next

Your AI Isn't Creative. Your Process Can Be.

Your AI Isn't Creative. Your Process Can Be.

I sat down with Claude and a half-formed idea. Two hours later, I had a complete novel premise with layered characters, a structural backbone, and enough depth to start writing the next day. Not because the AI was creative. Because the process was. The fragment The session didn't
Michael Blickenstaff 06 Mar 2026
AI Can't Agree on What Good Writing Is

AI Can't Agree on What Good Writing Is

We use three independent AI evaluators in the AgentPulse v2.2 benchmark: Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.2. Different providers, blinded evaluation, identical rubrics. They grade 15 models across 28 prompts with 3 runs each. On structured tasks, they mostly agree. On creative writing, they
Michael Blickenstaff 05 Mar 2026
Your AI Writes Well. Everyone Can Tell.

Your AI Writes Well. Everyone Can Tell.

Llama 4 Maverick scores 3.48 on task quality in our AgentPulse v2.2 benchmark. Respectable. Then you ask it to write something creative, and the score drops to 2.02. A 42% collapse. Same model, same evaluation, different kind of work. It's not alone. We tested 15
Michael Blickenstaff 03 Mar 2026

Stay ahead of the curve

Weekly benchmark data, tool breakdowns, and honest analysis for AI practitioners. No hype, no fluff.

Topics

  • News
  • Deep Dives
  • Data
  • Tools
  • Building
  • Frontier

MakerPulse

  • About
  • AgentPulse Data

Connect

  • Bluesky
  • GitHub
MakerPulse © 2026