MiniMax M3 Benchmark: Coding, Agentic, and Long-Context Scores

MiniMax M3 benchmark results on SWE-Bench Pro, Terminal-Bench, BrowseComp, and MCP Atlas, how the model compares to GPT-5.5 and Claude Opus 4.8, and what early developers are saying.

Jun 1, 2026M-Chat Team

MiniMax M3 benchmark results are easiest to read as a table, not as a handful of headline numbers. In the official June 1, 2026 release, MiniMax compared M3 with MiniMax M2.7, Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek V4 Pro, GLM 5.1 Thinking, and Kimi K2.6 Thinking across coding, agent, GUI, multimodal, and reasoning tasks.

The short version: MiniMax M3 is strongest where the work looks like real software engineering or real agent execution. It posts 59.0 on SWE-Bench Pro, 66.0 on Terminal Bench 2.1, 83.52 on BrowseComp, 74.2 on MCP Atlas, and 70.06 on OSWorld-Verified. It does not win every row, but it is broadly competitive while keeping the 1M context and open-weight positioning that make it practical for long-context development work.

MiniMax M3 Benchmark: Quick Answer

BenchmarkMiniMax M3What it measures
SWE-Bench Pro59.0Real software engineering fixes
Terminal Bench 2.166.0Shell and terminal task completion
BrowseComp83.52Long-horizon web-browsing agents
MCP Atlas74.2Tool-connected MCP task execution
OSWorld-Verified70.06Desktop GUI task completion

Full Benchmark Table

Dashes match empty cells in the official MiniMax table.

Coding Benchmarks

BenchmarkMiniMax M3MiniMax M2.7Claude Opus 4.7GPT 5.5Gemini 3.1 ProClaude Sonnet 4.6DeepSeek V4 ProGLM 5.1 ThinkingKimi K2.6 Thinking
SWE-Bench Verified80.579.987.682.980.679.680.6-80.2
SWE-Bench Pro59.056.264.358.654.2-55.458.458.6
Terminal Bench 2.166.051.166.178.270.3----
SWE Atlas-QnA37.911.2945.1645.4313.531.20---
nl2repo42.1334.9956.2852.921.62-35.54142.8
SWE Atlas-Test Writing30.8318.8938.2142.5929.8431.76---
SWE-fficiency34.813.9842.246.619.7----
LiveSQLBench40.1733.1741.0040.1739.83----
CL-bench20.4815.3822.9225.3821.06----
VIBE-V250.1237.8955.8750.5028.00----
SVG-Bench63.748.062.358.259.2----
PostTrainBench37.113.142.439.315.2----
KernelBench Hard28.810.530.720.918.6----
PaperBench52.630.658.557.546.7----

Cowork and Agent Benchmarks

BenchmarkMiniMax M3MiniMax M2.7Claude Opus 4.7GPT 5.5Gemini 3.1 ProClaude Sonnet 4.6DeepSeek V4 ProGLM 5.1 ThinkingKimi K2.6 Thinking
BrowseComp83.5276.379.384.485.974.783.479.383.2
DRACO73.2366.7777.7--75.8---
GDPval rubrics74.7866.4479.880.6657.8275.6570.3268.2665.12
BankerToolBench76.1263.8981.3470.0467.03----
OfficeQA Pro45.1-43.652.618.1----
SpreadSheetBench-v189.3584.9288.4988.1156.06-84.985.284.5
YC-Bench2.10M02.19M1.28M1.05M----
LOCA-Bench (256k)49.3057------
MCP Atlas74.249.47775.369.261.373.671.866.6
Apex-Agents27.75.637.241.733.426.2---
Claw-Eval74.549.771.6-57.868.358.462.761.5

GUI, Multimodal, and Reasoning Benchmarks

BenchmarkMiniMax M3MiniMax M2.7Claude Opus 4.7GPT 5.5Gemini 3.1 ProClaude Sonnet 4.6DeepSeek V4 ProGLM 5.1 ThinkingKimi K2.6 Thinking
OSWorld-Verified70.06-82.878.776.272.580.6-73.1
OmniDocBench91.6-89.387.588.186.9---
MMMU-Pro78.1-7781.280.574.5--79.4
Video-MMMU84.6-8386.487.9----
VideoMME (w/ sub)85.4--89.487.9----
IMO 202535 / 42--------
USAMO 202636 / 42-52.8%98.21%74.40%----

What the MiniMax M3 Benchmark Says

The official table supports three practical conclusions. First, M3 is clearly stronger than M2.7 across the engineering-heavy rows. Second, it is closest to the frontier models on agent and tool-use work such as BrowseComp, MCP Atlas, Claw-Eval, and SpreadSheetBench-v1. Third, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro still lead individual rows, so the correct read is not "M3 wins everything." The useful read is that M3 offers a strong open-weight, 1M-context baseline for coding and agent workflows.

Turn the Benchmark Into Your Own Evaluation

Use the public table to choose test areas, then run your own tasks. A good local suite is one code-review prompt against a real repository, one repo-navigation task, one terminal task, one spreadsheet or browser-agent task, and one long-context summary. Log success, retries, latency, and human correction cost. That will tell you more about your stack than any single SWE-Bench score.

Sources

M-Chat Team

M-Chat Team

MiniMax M3 Benchmark: Coding, Agentic, and Long-Context Scores