MiniMax M3 Benchmark: Coding, Agentic, and Long-Context Scores
MiniMax M3 benchmark results on SWE-Bench Pro, Terminal-Bench, BrowseComp, and MCP Atlas, how the model compares to GPT-5.5 and Claude Opus 4.8, and what early developers are saying.
MiniMax M3 benchmark results are easiest to read as a table, not as a handful of headline numbers. In the official June 1, 2026 release, MiniMax compared M3 with MiniMax M2.7, Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek V4 Pro, GLM 5.1 Thinking, and Kimi K2.6 Thinking across coding, agent, GUI, multimodal, and reasoning tasks.
The short version: MiniMax M3 is strongest where the work looks like real software engineering or real agent execution. It posts 59.0 on SWE-Bench Pro, 66.0 on Terminal Bench 2.1, 83.52 on BrowseComp, 74.2 on MCP Atlas, and 70.06 on OSWorld-Verified. It does not win every row, but it is broadly competitive while keeping the 1M context and open-weight positioning that make it practical for long-context development work.
MiniMax M3 Benchmark: Quick Answer
| Benchmark | MiniMax M3 | What it measures |
|---|---|---|
| SWE-Bench Pro | 59.0 | Real software engineering fixes |
| Terminal Bench 2.1 | 66.0 | Shell and terminal task completion |
| BrowseComp | 83.52 | Long-horizon web-browsing agents |
| MCP Atlas | 74.2 | Tool-connected MCP task execution |
| OSWorld-Verified | 70.06 | Desktop GUI task completion |
Full Benchmark Table
Dashes match empty cells in the official MiniMax table.
Coding Benchmarks
| Benchmark | MiniMax M3 | MiniMax M2.7 | Claude Opus 4.7 | GPT 5.5 | Gemini 3.1 Pro | Claude Sonnet 4.6 | DeepSeek V4 Pro | GLM 5.1 Thinking | Kimi K2.6 Thinking |
|---|---|---|---|---|---|---|---|---|---|
| SWE-Bench Verified | 80.5 | 79.9 | 87.6 | 82.9 | 80.6 | 79.6 | 80.6 | - | 80.2 |
| SWE-Bench Pro | 59.0 | 56.2 | 64.3 | 58.6 | 54.2 | - | 55.4 | 58.4 | 58.6 |
| Terminal Bench 2.1 | 66.0 | 51.1 | 66.1 | 78.2 | 70.3 | - | - | - | - |
| SWE Atlas-QnA | 37.9 | 11.29 | 45.16 | 45.43 | 13.5 | 31.20 | - | - | - |
| nl2repo | 42.13 | 34.99 | 56.28 | 52.9 | 21.62 | - | 35.5 | 41 | 42.8 |
| SWE Atlas-Test Writing | 30.83 | 18.89 | 38.21 | 42.59 | 29.84 | 31.76 | - | - | - |
| SWE-fficiency | 34.8 | 13.98 | 42.2 | 46.6 | 19.7 | - | - | - | - |
| LiveSQLBench | 40.17 | 33.17 | 41.00 | 40.17 | 39.83 | - | - | - | - |
| CL-bench | 20.48 | 15.38 | 22.92 | 25.38 | 21.06 | - | - | - | - |
| VIBE-V2 | 50.12 | 37.89 | 55.87 | 50.50 | 28.00 | - | - | - | - |
| SVG-Bench | 63.7 | 48.0 | 62.3 | 58.2 | 59.2 | - | - | - | - |
| PostTrainBench | 37.1 | 13.1 | 42.4 | 39.3 | 15.2 | - | - | - | - |
| KernelBench Hard | 28.8 | 10.5 | 30.7 | 20.9 | 18.6 | - | - | - | - |
| PaperBench | 52.6 | 30.6 | 58.5 | 57.5 | 46.7 | - | - | - | - |
Cowork and Agent Benchmarks
| Benchmark | MiniMax M3 | MiniMax M2.7 | Claude Opus 4.7 | GPT 5.5 | Gemini 3.1 Pro | Claude Sonnet 4.6 | DeepSeek V4 Pro | GLM 5.1 Thinking | Kimi K2.6 Thinking |
|---|---|---|---|---|---|---|---|---|---|
| BrowseComp | 83.52 | 76.3 | 79.3 | 84.4 | 85.9 | 74.7 | 83.4 | 79.3 | 83.2 |
| DRACO | 73.23 | 66.77 | 77.7 | - | - | 75.8 | - | - | - |
| GDPval rubrics | 74.78 | 66.44 | 79.8 | 80.66 | 57.82 | 75.65 | 70.32 | 68.26 | 65.12 |
| BankerToolBench | 76.12 | 63.89 | 81.34 | 70.04 | 67.03 | - | - | - | - |
| OfficeQA Pro | 45.1 | - | 43.6 | 52.6 | 18.1 | - | - | - | - |
| SpreadSheetBench-v1 | 89.35 | 84.92 | 88.49 | 88.11 | 56.06 | - | 84.9 | 85.2 | 84.5 |
| YC-Bench | 2.10M | 0 | 2.19M | 1.28M | 1.05M | - | - | - | - |
| LOCA-Bench (256k) | 49.3 | 0 | 57 | - | - | - | - | - | - |
| MCP Atlas | 74.2 | 49.4 | 77 | 75.3 | 69.2 | 61.3 | 73.6 | 71.8 | 66.6 |
| Apex-Agents | 27.7 | 5.6 | 37.2 | 41.7 | 33.4 | 26.2 | - | - | - |
| Claw-Eval | 74.5 | 49.7 | 71.6 | - | 57.8 | 68.3 | 58.4 | 62.7 | 61.5 |
GUI, Multimodal, and Reasoning Benchmarks
| Benchmark | MiniMax M3 | MiniMax M2.7 | Claude Opus 4.7 | GPT 5.5 | Gemini 3.1 Pro | Claude Sonnet 4.6 | DeepSeek V4 Pro | GLM 5.1 Thinking | Kimi K2.6 Thinking |
|---|---|---|---|---|---|---|---|---|---|
| OSWorld-Verified | 70.06 | - | 82.8 | 78.7 | 76.2 | 72.5 | 80.6 | - | 73.1 |
| OmniDocBench | 91.6 | - | 89.3 | 87.5 | 88.1 | 86.9 | - | - | - |
| MMMU-Pro | 78.1 | - | 77 | 81.2 | 80.5 | 74.5 | - | - | 79.4 |
| Video-MMMU | 84.6 | - | 83 | 86.4 | 87.9 | - | - | - | - |
| VideoMME (w/ sub) | 85.4 | - | - | 89.4 | 87.9 | - | - | - | - |
| IMO 2025 | 35 / 42 | - | - | - | - | - | - | - | - |
| USAMO 2026 | 36 / 42 | - | 52.8% | 98.21% | 74.40% | - | - | - | - |
What the MiniMax M3 Benchmark Says
The official table supports three practical conclusions. First, M3 is clearly stronger than M2.7 across the engineering-heavy rows. Second, it is closest to the frontier models on agent and tool-use work such as BrowseComp, MCP Atlas, Claw-Eval, and SpreadSheetBench-v1. Third, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro still lead individual rows, so the correct read is not "M3 wins everything." The useful read is that M3 offers a strong open-weight, 1M-context baseline for coding and agent workflows.
Turn the Benchmark Into Your Own Evaluation
Use the public table to choose test areas, then run your own tasks. A good local suite is one code-review prompt against a real repository, one repo-navigation task, one terminal task, one spreadsheet or browser-agent task, and one long-context summary. Log success, retries, latency, and human correction cost. That will tell you more about your stack than any single SWE-Bench score.
