The Best Coding Model You Can Actually Afford
MiniMax M2.5 scores 80.2% on SWE-bench Verified. Claude Opus 4.6 scores 80.8%. The difference is 0.6 percentage points. The price difference is 17x.
At $0.30 per million input tokens and $1.20 per million output tokens, M2.5 is the cheapest model to crack the 80% SWE-bench barrier. For builders running AI-assisted coding at any kind of scale, that math changes decisions.
Who is MiniMax?
MiniMax is a Shanghai-based AI lab founded in 2021 by former SenseTime researchers. Their early backers included MiHoYo (the studio behind Genshin Impact), and they raised $600 million from Alibaba in 2024. They went public on the Hong Kong Stock Exchange in January 2026. Most people know them through Hailuo AI, their video generation product, but their language models have been climbing benchmarks quietly for months.
M2.5 uses a Mixture-of-Experts architecture: 229 billion total parameters, but only 10 billion activate per token. That extreme sparsity is the whole pricing story. You get frontier-class coding performance while paying for a model one-twentieth the size.
The numbers
Here's how M2.5 stacks up against the models builders are actually choosing between:
| Model | SWE-bench Verified | Input/1M | Output/1M | Context |
|---|---|---|---|---|
| Claude Opus 4.6 | 80.8% | $5.00 | $25.00 | 200K |
| Gemini 3.1 Pro | 80.6% | $2.00 | $12.00 | 1M |
| MiniMax M2.5 | 80.2% | $0.30 | $1.20 | 200K |
| GLM-5 (open source) | 77.8% | $1.00 | $3.20 | 200K |
M2.5 also comes in a Lightning variant (called "highspeed" in the API) that doubles throughput to roughly 100 tokens per second for $0.30/$2.40. Still cheaper than everything else on the table.
The model is open-weight under a Modified MIT license, with weights on Hugging Face. You can self-host it via vLLM or SGLang, or access it through OpenRouter, Fireworks, DeepInfra, NVIDIA NIM, and MiniMax's own API.
What the caveats are
Benchmark numbers require context. Here's what you should know before switching.
The SWE-bench score used Claude Code as scaffolding. MiniMax's 80.2% was measured using Claude Code's agent framework with the default system prompt overridden, averaged over four runs. On other scaffoldings like Droid (79.7%) and OpenCode (76.1%), scores drop a few points. The number is real, but it reflects M2.5 running inside a specific agent setup, not M2.5 in isolation.
Hallucination rates went up, not down. Artificial Analysis measured an 88% hallucination rate on their Omniscience benchmark, up from 67% on the predecessor M2.1. If your use case involves factual accuracy rather than code generation, that's a real problem.
General reasoning still trails frontier models. M2.5 scores 42 on the Artificial Analysis Intelligence Index, behind GLM-5 (50), Kimi K2.5 (47), and well behind Opus and Gemini. On GPQA-Diamond, which tests graduate-level reasoning, M2.5 hits 85.2% while Gemini 3.1 Pro scores 94.3%. This is a coding specialist, not a general-purpose model.
The license has a branding catch. The Modified MIT license requires you to prominently display "MiniMax M2.5" in the UI of any commercial product using the model. That's more restrictive than Apache 2.0 and could matter for white-label or embedded use cases.
The architecture hasn't changed. M2.5 is structurally identical to M2, which shipped in October 2025. Same parameter count, same expert routing, same activation pattern. All the improvement comes from post-training via reinforcement learning. That's encouraging for what RL can extract from a fixed architecture, but it means the ceiling is the same.
Where it fits
The honest recommendation: M2.5 is compelling for coding workloads where you're already running agent scaffolding and cost is a real constraint. If you're spinning up AI-assisted code generation at volume, paying $1.20 per million output tokens instead of $25.00 adds up fast. At those prices, you can afford to run more iterations, more candidates, more test-and-retry loops.
For anything requiring factual reliability, long-form reasoning, or creative quality, look elsewhere. The hallucination regression alone rules it out for production use cases where accuracy matters more than cost.
We're testing it ourselves
We're adding MiniMax M2.5 to the AgentPulse benchmark in our next evaluation run. The benchmark tests 28 prompts across five tracks with three independent runs per model and a triple-evaluator grading panel. We'll see how the $0.30 model holds up on our own methodology rather than SWE-bench alone.
If the coding performance translates to broader task quality at that price point, M2.5 changes the cost calculus for a lot of builder workflows. If it doesn't, you'll have the data to know exactly where it falls short.
MiniMax M2.5 is available now via OpenRouter, MiniMax API, and Hugging Face for self-hosting.