Xiaomi MiMO-V2-Flash
Xiaomi MiMO-V2-Flash: Overview
Xiaomi MiMO-V2-Flash is a cutting-edge open-source Mixture-of-Experts (MoE) large language model (LLM) developed by Xiaomi, released on December 16, 2025. It represents a significant advancement in efficient AI inference, combining frontier-level performance with high-speed processing and low costs. Designed primarily for reasoning, coding, agentic tasks, and long-context applications, it builds on Xiaomi's earlier MiMO efforts and is positioned as a competitive alternative to models like DeepSeek-V3.2 and Claude Sonnet 4.5. The model is fully open-sourced under the MIT license, making it accessible for research and deployment.
Key Features and Technical Specifications
- Parameters: 309 billion total parameters, with only 15 billion active per forward pass—enabling massive scale without proportional compute demands.
- Context Window: Up to 256,000 tokens, supporting extended agent interactions, tool calls, and multi-turn conversations.
- Inference Speed: Up to 150 tokens per second, with a 2.0–2.6× speedup via native Multi-Token Prediction (MTP).
- Cost Efficiency: $0.1 per million input tokens and $0.3 per million output tokens (assuming a 3:1 input/output ratio).
- Training Data: Trained on 27 trillion tokens in FP8 mixed precision for efficiency.
- Modes: Supports a "hybrid thinking mode" for step-by-step reasoning or instant responses, toggleable by users.
- Multilingual Focus: Strong performance in Chinese and English, with emphasis on agentic workflows like web development and code generation.
Architecture
MiMO-V2-Flash employs a hybrid attention mechanism to balance efficiency and long-context capability:
- Sliding Window Attention (SWA) with an aggressive 128-token window for local focus.
- Global Attention (GA) interleaved in a 5:1 ratio (GA:SWA) across 8 hybrid blocks.
- Learnable Attention Sink Bias: Maintains performance over ultra-long contexts.
- Multi-Token Prediction (MTP) Module: A lightweight 0.33B-parameter addition per block using dense Feed-Forward Networks (FFNs) and SWA. It enables self-speculative decoding, accepting 2.8–3.6 tokens per draft for faster generation.
- Post-Training: Uses MOPD (Multi-Teacher Online Policy Distillation), an efficient on-policy method that distills knowledge from multiple expert teachers with <1/50th the compute of traditional Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL) pipelines. This supports continuous self-improvement.
The base model (MiMo-V2-Flash-Base) is also released, allowing fine-tuning for specialized tasks.
Benchmarks and Performance Comparisons
MiMO-V2-Flash excels in open-source rankings, particularly in coding and agentic benchmarks, often matching or surpassing closed-source leaders like GPT-5 and Gemini 3.0 Pro. It ranks #1 among open-source models on SWE-Bench and leads in math/reasoning tasks like AIME 2025.
Here's a summary of key benchmarks (post-training version):
| Benchmark Category | Specific Benchmark | MiMO-V2-Flash Score | DeepSeek-V3.2 | K2-Thinking | Claude Sonnet 4.5 | GPT-5 (High) | Gemini 3.0 Pro |
|---|---|---|---|---|---|---|---|
| Reasoning | MMLU-Pro | 84.9 | 85.0 | 84.6 | 88.2 | 87.5 | 90.1 |
| GPQA-Diamond | 83.7 | 82.4 | 84.5 | 83.4 | 85.7 | 91.9 | |
| AIME 2025 | 94.1 | 93.1 | 94.5 | 87.0 | 94.6 | 95.0 | |
| HMMT Feb. 2025 | 84.4 | 92.5 | 89.4 | 79.2 | 88.3 | 97.5 | |
| Coding | LiveCodeBench-v6 | 80.6 | 83.3 | 83.1 | 64.0 | 84.5 | 90.7 |
| SWE-Bench Verified | 73.4 | 73.1 | 71.3 | 77.2 | 74.9 | 76.2 | |
| SWE-Bench Multilingual | 71.7 | 70.2 | 61.1 | 68.0 | 55.3 | — | |
| Agentic Tasks | Terminal Bench Hard | 30.5 | 35.4 | 30.6 | 33.3 | 30.5 | 39.0 |
| τ²-Bench | 80.3 | 80.3 | 74.3 | 84.7 | 80.2 | 85.4 | |
| Long Context | LongBench V2 | 60.6 | 58.4 | 45.1 | — | — | 65.6 |
| MRCR | 45.7 | — | 44.2 | — | — | 89.7 | |
| General | Arena-Hard (Hard Prompt) | 54.1 | 53.4 | 71.9 | 63.3 | 71.9 | 72.6 |
- Strengths: Tops open-source in SWE-Bench (coding resolution) and AIME (math), with long-context outperforming full-attention models like K2-Thinking.
- Base Model Benchmarks: MMLU-Pro (73.2), MATH (71.0), HumanEval+ (70.7), and 96.7% on NIAH-Multi (256K context).
Availability and Usage
- Download: Model weights on Hugging Face at XiaomiMiMo/MiMo-V2-Flash. Includes base model and 3-layer MTP weights.
- API Access: Free limited-time trial on Xiaomi's platform (platform.xiaomimimo.com) and AI Studio (aistudio.xiaomimimo.com).
- Inference: Optimized for SGLang framework (GitHub: sgl-project/sglang). Example launch:
--tp-size 8 --enable-mtp --speculative-algorithm EAGLE. Use FP8 for efficiency. - API Example (Chat Completion):
{ "messages": [{"role": "user", "content": "Your query here"}], "max_tokens": 1024, "temperature": 0.8, "top_p": 0.95 }- For agentic/math tasks: Set
temperature=0.3. - System Prompt: "You are MiMo, an AI assistant developed by Xiaomi. The current date is [date]."
- Tool-Use: Persist reasoning history in multi-turn calls.
- For agentic/math tasks: Set
- Technical Report: Available on GitHub (XiaomiMiMo/MiMo-V2-Flash).
Unique Aspects and Applications
- Web Development: Generates fully functional HTML pages from prompts, e.g., simulating macOS interfaces with interactive Terminal and Finder.
- Agentic Excellence: Strong in multi-step workflows, tool calling, and browsing (e.g., 58.3% on BrowseComp with context management).
- General Assistance: Handles philosophical discussions, creative writing, and everyday tasks with structured reasoning.
- Limitations: May underperform on very long contexts without bias tweaks; requires significant GPU resources (e.g., 8x TP for inference); agentic tasks need history persistence.
MiMO-V2-Flash democratizes high-performance AI through its open-source nature and efficiency, making it ideal for developers building agents, coders, and researchers in multilingual/long-context scenarios. For hands-on trials, start with the Hugging Face space or API. If you need code examples or comparisons to specific models, let me know!