Xiaomi MiMO-V2-Flash

Xiaomi MiMO-V2-Flash: Overview

Xiaomi MiMO-V2-Flash is a cutting-edge open-source Mixture-of-Experts (MoE) large language model (LLM) developed by Xiaomi, released on December 16, 2025. It represents a significant advancement in efficient AI inference, combining frontier-level performance with high-speed processing and low costs. Designed primarily for reasoning, coding, agentic tasks, and long-context applications, it builds on Xiaomi's earlier MiMO efforts and is positioned as a competitive alternative to models like DeepSeek-V3.2 and Claude Sonnet 4.5. The model is fully open-sourced under the MIT license, making it accessible for research and deployment.

Key Features and Technical Specifications

  • Parameters: 309 billion total parameters, with only 15 billion active per forward pass—enabling massive scale without proportional compute demands.
  • Context Window: Up to 256,000 tokens, supporting extended agent interactions, tool calls, and multi-turn conversations.
  • Inference Speed: Up to 150 tokens per second, with a 2.0–2.6× speedup via native Multi-Token Prediction (MTP).
  • Cost Efficiency: $0.1 per million input tokens and $0.3 per million output tokens (assuming a 3:1 input/output ratio).
  • Training Data: Trained on 27 trillion tokens in FP8 mixed precision for efficiency.
  • Modes: Supports a "hybrid thinking mode" for step-by-step reasoning or instant responses, toggleable by users.
  • Multilingual Focus: Strong performance in Chinese and English, with emphasis on agentic workflows like web development and code generation.

Architecture

MiMO-V2-Flash employs a hybrid attention mechanism to balance efficiency and long-context capability:

  • Sliding Window Attention (SWA) with an aggressive 128-token window for local focus.
  • Global Attention (GA) interleaved in a 5:1 ratio (GA:SWA) across 8 hybrid blocks.
  • Learnable Attention Sink Bias: Maintains performance over ultra-long contexts.
  • Multi-Token Prediction (MTP) Module: A lightweight 0.33B-parameter addition per block using dense Feed-Forward Networks (FFNs) and SWA. It enables self-speculative decoding, accepting 2.8–3.6 tokens per draft for faster generation.
  • Post-Training: Uses MOPD (Multi-Teacher Online Policy Distillation), an efficient on-policy method that distills knowledge from multiple expert teachers with <1/50th the compute of traditional Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL) pipelines. This supports continuous self-improvement.

The base model (MiMo-V2-Flash-Base) is also released, allowing fine-tuning for specialized tasks.

Benchmarks and Performance Comparisons

MiMO-V2-Flash excels in open-source rankings, particularly in coding and agentic benchmarks, often matching or surpassing closed-source leaders like GPT-5 and Gemini 3.0 Pro. It ranks #1 among open-source models on SWE-Bench and leads in math/reasoning tasks like AIME 2025.

Here's a summary of key benchmarks (post-training version):

Benchmark Category Specific Benchmark MiMO-V2-Flash Score DeepSeek-V3.2 K2-Thinking Claude Sonnet 4.5 GPT-5 (High) Gemini 3.0 Pro
Reasoning MMLU-Pro 84.9 85.0 84.6 88.2 87.5 90.1
GPQA-Diamond 83.7 82.4 84.5 83.4 85.7 91.9
AIME 2025 94.1 93.1 94.5 87.0 94.6 95.0
HMMT Feb. 2025 84.4 92.5 89.4 79.2 88.3 97.5
Coding LiveCodeBench-v6 80.6 83.3 83.1 64.0 84.5 90.7
SWE-Bench Verified 73.4 73.1 71.3 77.2 74.9 76.2
SWE-Bench Multilingual 71.7 70.2 61.1 68.0 55.3
Agentic Tasks Terminal Bench Hard 30.5 35.4 30.6 33.3 30.5 39.0
τ²-Bench 80.3 80.3 74.3 84.7 80.2 85.4
Long Context LongBench V2 60.6 58.4 45.1 65.6
MRCR 45.7 44.2 89.7
General Arena-Hard (Hard Prompt) 54.1 53.4 71.9 63.3 71.9 72.6
  • Strengths: Tops open-source in SWE-Bench (coding resolution) and AIME (math), with long-context outperforming full-attention models like K2-Thinking.
  • Base Model Benchmarks: MMLU-Pro (73.2), MATH (71.0), HumanEval+ (70.7), and 96.7% on NIAH-Multi (256K context).

Availability and Usage

  • Download: Model weights on Hugging Face at XiaomiMiMo/MiMo-V2-Flash. Includes base model and 3-layer MTP weights.
  • API Access: Free limited-time trial on Xiaomi's platform (platform.xiaomimimo.com) and AI Studio (aistudio.xiaomimimo.com).
  • Inference: Optimized for SGLang framework (GitHub: sgl-project/sglang). Example launch: --tp-size 8 --enable-mtp --speculative-algorithm EAGLE. Use FP8 for efficiency.
  • API Example (Chat Completion):
    {
    "messages": [{"role": "user", "content": "Your query here"}],
    "max_tokens": 1024,
    "temperature": 0.8,
    "top_p": 0.95
    }
    • For agentic/math tasks: Set temperature=0.3.
    • System Prompt: "You are MiMo, an AI assistant developed by Xiaomi. The current date is [date]."
    • Tool-Use: Persist reasoning history in multi-turn calls.
  • Technical Report: Available on GitHub (XiaomiMiMo/MiMo-V2-Flash).

Unique Aspects and Applications

  • Web Development: Generates fully functional HTML pages from prompts, e.g., simulating macOS interfaces with interactive Terminal and Finder.
  • Agentic Excellence: Strong in multi-step workflows, tool calling, and browsing (e.g., 58.3% on BrowseComp with context management).
  • General Assistance: Handles philosophical discussions, creative writing, and everyday tasks with structured reasoning.
  • Limitations: May underperform on very long contexts without bias tweaks; requires significant GPU resources (e.g., 8x TP for inference); agentic tasks need history persistence.

MiMO-V2-Flash democratizes high-performance AI through its open-source nature and efficiency, making it ideal for developers building agents, coders, and researchers in multilingual/long-context scenarios. For hands-on trials, start with the Hugging Face space or API. If you need code examples or comparisons to specific models, let me know!

Die Suchergebnisse wurden von einer KI erstellt und sollten mit entsprechender Sorgfalt überprüft werden.