MiniMax M3 Benchmark: Coding, Agentic, and Long-Context Scores

MiniMax M3 benchmark results on SWE-Bench Pro, Terminal-Bench, BrowseComp, and MCP Atlas, how the model compares to GPT-5.5 and Claude Opus 4.8, and what early developers are saying.

Jun 1, 2026M-Chat Team

MiniMax M3 benchmark results are easiest to read as a table, not as a handful of headline numbers. In the official June 1, 2026 release, MiniMax compared M3 with MiniMax M2.7, Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek V4 Pro, GLM 5.1 Thinking, and Kimi K2.6 Thinking across coding, agent, GUI, multimodal, and reasoning tasks.

The short version: MiniMax M3 is strongest where the work looks like real software engineering or real agent execution. It posts 59.0 on SWE-Bench Pro, 66.0 on Terminal Bench 2.1, 83.52 on BrowseComp, 74.2 on MCP Atlas, and 70.06 on OSWorld-Verified. It does not win every row, but it is broadly competitive while keeping the 1M context and open-weight positioning that make it practical for long-context development work.

MiniMax M3 Benchmark: Quick Answer

Benchmark	MiniMax M3	What it measures
SWE-Bench Pro	59.0	Real software engineering fixes
Terminal Bench 2.1	66.0	Shell and terminal task completion
BrowseComp	83.52	Long-horizon web-browsing agents
MCP Atlas	74.2	Tool-connected MCP task execution
OSWorld-Verified	70.06	Desktop GUI task completion

Full Benchmark Table

Dashes match empty cells in the official MiniMax table.

Coding Benchmarks

Benchmark	MiniMax M3	MiniMax M2.7	Claude Opus 4.7	GPT 5.5	Gemini 3.1 Pro	Claude Sonnet 4.6	DeepSeek V4 Pro	GLM 5.1 Thinking	Kimi K2.6 Thinking
SWE-Bench Verified	80.5	79.9	87.6	82.9	80.6	79.6	80.6	-	80.2
SWE-Bench Pro	59.0	56.2	64.3	58.6	54.2	-	55.4	58.4	58.6
Terminal Bench 2.1	66.0	51.1	66.1	78.2	70.3	-	-	-	-
SWE Atlas-QnA	37.9	11.29	45.16	45.43	13.5	31.20	-	-	-
nl2repo	42.13	34.99	56.28	52.9	21.62	-	35.5	41	42.8
SWE Atlas-Test Writing	30.83	18.89	38.21	42.59	29.84	31.76	-	-	-
SWE-fficiency	34.8	13.98	42.2	46.6	19.7	-	-	-	-
LiveSQLBench	40.17	33.17	41.00	40.17	39.83	-	-	-	-
CL-bench	20.48	15.38	22.92	25.38	21.06	-	-	-	-
VIBE-V2	50.12	37.89	55.87	50.50	28.00	-	-	-	-
SVG-Bench	63.7	48.0	62.3	58.2	59.2	-	-	-	-
PostTrainBench	37.1	13.1	42.4	39.3	15.2	-	-	-	-
KernelBench Hard	28.8	10.5	30.7	20.9	18.6	-	-	-	-
PaperBench	52.6	30.6	58.5	57.5	46.7	-	-	-	-

Cowork and Agent Benchmarks

Benchmark	MiniMax M3	MiniMax M2.7	Claude Opus 4.7	GPT 5.5	Gemini 3.1 Pro	Claude Sonnet 4.6	DeepSeek V4 Pro	GLM 5.1 Thinking	Kimi K2.6 Thinking
BrowseComp	83.52	76.3	79.3	84.4	85.9	74.7	83.4	79.3	83.2
DRACO	73.23	66.77	77.7	-	-	75.8	-	-	-
GDPval rubrics	74.78	66.44	79.8	80.66	57.82	75.65	70.32	68.26	65.12
BankerToolBench	76.12	63.89	81.34	70.04	67.03	-	-	-	-
OfficeQA Pro	45.1	-	43.6	52.6	18.1	-	-	-	-
SpreadSheetBench-v1	89.35	84.92	88.49	88.11	56.06	-	84.9	85.2	84.5
YC-Bench	2.10M	0	2.19M	1.28M	1.05M	-	-	-	-
LOCA-Bench (256k)	49.3	0	57	-	-	-	-	-	-
MCP Atlas	74.2	49.4	77	75.3	69.2	61.3	73.6	71.8	66.6
Apex-Agents	27.7	5.6	37.2	41.7	33.4	26.2	-	-	-
Claw-Eval	74.5	49.7	71.6	-	57.8	68.3	58.4	62.7	61.5

GUI, Multimodal, and Reasoning Benchmarks

Benchmark	MiniMax M3	MiniMax M2.7	Claude Opus 4.7	GPT 5.5	Gemini 3.1 Pro	Claude Sonnet 4.6	DeepSeek V4 Pro	GLM 5.1 Thinking	Kimi K2.6 Thinking
OSWorld-Verified	70.06	-	82.8	78.7	76.2	72.5	80.6	-	73.1
OmniDocBench	91.6	-	89.3	87.5	88.1	86.9	-	-	-
MMMU-Pro	78.1	-	77	81.2	80.5	74.5	-	-	79.4
Video-MMMU	84.6	-	83	86.4	87.9	-	-	-	-
VideoMME (w/ sub)	85.4	-	-	89.4	87.9	-	-	-	-
IMO 2025	35 / 42	-	-	-	-	-	-	-	-
USAMO 2026	36 / 42	-	52.8%	98.21%	74.40%	-	-	-	-

What the MiniMax M3 Benchmark Says

The official table supports three practical conclusions. First, M3 is clearly stronger than M2.7 across the engineering-heavy rows. Second, it is closest to the frontier models on agent and tool-use work such as BrowseComp, MCP Atlas, Claw-Eval, and SpreadSheetBench-v1. Third, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro still lead individual rows, so the correct read is not "M3 wins everything." The useful read is that M3 offers a strong open-weight, 1M-context baseline for coding and agent workflows.

Turn the Benchmark Into Your Own Evaluation

Use the public table to choose test areas, then run your own tasks. A good local suite is one code-review prompt against a real repository, one repo-navigation task, one terminal task, one spreadsheet or browser-agent task, and one long-context summary. Log success, retries, latency, and human correction cost. That will tell you more about your stack than any single SWE-Bench score.

Table of Contents