News

Gemini 3.1 Pro Just Shipped. Same Price, More Than Double the Reasoning.

21 Feb 2026 5 min read

Gemini 3.1 Pro scored 77.1% on ARC-AGI-2. The previous version, Gemini 3 Pro, scored 31.1%. That's not an incremental update. It's a 2.5x jump on the hardest public reasoning benchmark available, and Google didn't raise the price by a single cent.

Released on February 19, 2026, Gemini 3.1 Pro is available in preview through the Gemini API, Google AI Studio, Vertex AI, Android Studio, NotebookLM, and the Gemini app (for Pro and Ultra subscribers). The pricing stays at $2.00 per million input tokens and $12.00 per million output tokens, identical to Gemini 3 Pro. The 1M token context window and 64K token output limit carry over. If you're already paying for Gemini 3 Pro API calls, this is a free upgrade.

The real question isn't the benchmark number. It's whether that number translates into better results on the work you actually do.

Where 2.5x Reasoning Shows Up

ARC-AGI-2 tests a model's ability to solve novel abstract reasoning patterns it hasn't seen during training. A jump from 31.1% to 77.1% means the model is dramatically better at figuring out unfamiliar problems from limited examples. But "abstract reasoning" is vague. Here are three practitioner use cases where this kind of improvement makes a concrete difference.

1. Synthesizing contradictory information across long documents

If you've ever fed a model a stack of analyst reports, regulatory filings, or research papers that disagree with each other, you know the failure mode: the model picks a side, or it produces bland "on one hand, on the other hand" summaries that add nothing. The underlying problem is that holding contradictory claims in working memory while evaluating their relative credibility requires sustained reasoning, not pattern matching.

Gemini 3.1 Pro's three-tier thinking system (low, medium, high) lets you dial reasoning effort to match the task. For a quick summary, low mode keeps things fast. For a 200-page due diligence package with conflicting financial projections, high mode activates what Google describes as a "mini version of Gemini Deep Think" (the experimental reasoning system that produced these benchmark gains in the first place).

The 1M token context window means you can load the full document set in a single call instead of chunking and losing cross-document context. That combination (deep reasoning plus full context) is where the improvement should be most noticeable.

2. Multi-step analysis with many simultaneous constraints

Tax planning, supply chain optimization, compliance mapping: these are tasks where the model needs to track dozens of interacting rules, exceptions, and dependencies at once. Weaker reasoners drop constraints partway through the analysis. You get an answer that's correct for eight of ten rules and silently wrong on the other two.

On GPQA Diamond (graduate-level scientific reasoning), Gemini 3.1 Pro scored 94.3%, up from 84.2% for Gemini 3 Pro and ahead of both Claude Opus 4.6 (91.3%) and GPT-5.2 (92.4%). That benchmark specifically tests whether a model can hold complex domain knowledge in memory and apply it accurately across multiple reasoning steps. The improvement tracks with what practitioners need for constraint-heavy analytical work.

3. Coding tasks that require architectural understanding

Writing a function is easy for every frontier model. Understanding how that function fits into a system, where its assumptions about caller behavior are wrong, which edge cases propagate across module boundaries: that's where reasoning matters.

Gemini 3.1 Pro scored 80.6% on SWE-Bench Verified, essentially matching Claude Opus 4.6's 80.8%. On LiveCodeBench Pro, a competitive coding benchmark, it posted an Elo of 2,887, well ahead of GPT-5.2 at 2,393 and Gemini 3 Pro at 2,439. These benchmarks test multi-file, real-codebase tasks, not isolated LeetCode problems.

With the 64K output limit, the model can produce substantial refactoring plans, complete implementations, or detailed architectural analyses in a single response. Pair that with 1M tokens of input context, and you can feed an entire mid-size codebase and ask questions that span modules.

The Pricing Picture

This is where Gemini 3.1 Pro gets interesting for anyone running production workloads.

Model	Input (per 1M tokens)	Output (per 1M tokens)
Gemini 3.1 Pro	$2.00	$12.00
Claude Opus 4.6	$5.00	$25.00
GPT-5.2	$1.75	$14.00
Claude Sonnet 4.6	$3.00	$15.00

GPT-5.2 is slightly cheaper on input, but Gemini 3.1 Pro undercuts it on output. Claude Opus 4.6 costs 2.5x more on input and more than double on output. For high-volume API workloads, the cost difference between Gemini 3.1 Pro and Opus 4.6 compounds fast.

Gemini 3.1 Pro also applies a premium rate for requests over 200K tokens: $4.00 input and $18.00 output per million tokens. That's still cheaper than Claude Sonnet 4.6's long-context rates ($6.00/$22.50) and far cheaper than Opus 4.6's ($10.00/$37.50).

What It Doesn't Win

Benchmarks don't tell the full story, and Gemini 3.1 Pro has real gaps.

On GDPval-AA, a benchmark evaluating professional-grade task performance across 44 occupations and 9 industries, Gemini 3.1 Pro posted an Elo of 1,317. Claude Sonnet 4.6 leads that benchmark at 1,633. If your work product needs to read like it was written by a senior analyst, Claude's output quality under human scrutiny still holds an edge.

On Humanity's Last Exam with tools enabled (search and code execution), Claude Opus 4.6 scored 53.1% to Gemini 3.1 Pro's 51.4%. For complex agentic workflows with many tool calls, the gap is narrow but real.

And this release is still in preview. Google hasn't announced a stable release timeline. If you need production guarantees and not just strong benchmarks, that matters.

The Verdict: Who Should Switch Today

Switch your default to Gemini 3.1 Pro if:

You run high-volume analytical workloads through the API and cost matters. At $2.00/$12.00, you're getting frontier-class reasoning at Sonnet-tier pricing.
Your primary tasks involve synthesizing large document sets, multi-step constraint analysis, or codebase-level code review. These are the use cases where the reasoning gains translate most directly.
You're prototyping and want to evaluate without commitment. Google AI Studio offers free preview access with rate limits.

Stay on your current model if:

You produce client-facing written output where tone, precision, and professional polish are the primary quality bar. Claude Opus 4.6 and Sonnet 4.6 still lead on expert-evaluated output quality.
You're running production agentic systems that depend on deterministic tool use patterns. Preview-stage models carry risk. Wait for the stable release.
Your existing pipeline is working and well-tuned. Swapping models mid-production to chase benchmark points is how you introduce regressions.

That Google shipped a ".1" increment instead of waiting for a mid-year ".5" update tells you something about the pace. The gap between Gemini and the competition on reasoning benchmarks just closed hard. Whether that matters for your specific work is a question only your own testing can answer, but at these prices and these scores, the testing is worth doing now.

Benchmark data sourced from Google's February 19, 2026 model release announcement and the ARC-AGI-2 public leaderboard. Pricing verified against Google's Gemini API pricing page on February 21, 2026. Competitive benchmark comparisons use each provider's published figures.

Gemini 3.1 Pro Just Shipped. Same Price, More Than Double the Reasoning.

Where 2.5x Reasoning Shows Up

1. Synthesizing contradictory information across long documents

2. Multi-step analysis with many simultaneous constraints

3. Coding tasks that require architectural understanding

The Pricing Picture

What It Doesn't Win

The Verdict: Who Should Switch Today

Read next

India's AI Content Labeling Law Just Went Live, and It's the World's First

WordPress Just Built AI Directly Into Its Editor. No Plugins Required.

GitHub Just Put AI Agents Inside Your CI/CD Pipeline