XiaomiMiMo
MiMo-V2-Flash
MiMo-7B-Base
Unlocking the Reasoning Potential of Language Model From Pretraining to Posttraining | π€ HuggingFace | π€οΈ ModelScope | π Technical Report | [2025.05.30] We scaled the SFT dataset from approximately 500K to 6M instances and continuously expanding the RL training window size from 32K to 48K, the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1 (79.8). MATH500 (Pass@1) 95.8 97.2 AIME 2024 (Pass@1) 68.2 80.1 AIME 2025 (Pass@1) 55.4 70.2 Code LiveCodeBench v5 (Pass@1) 57.8 60.9 LiveCodeBench v6 (Pass@1) 49.3 52.2 STEM GPQA-Diamond (Pass@1) 54.4 60.6 General Alignbench1.1 (Evaluated by GPT4.1) 6.9 7.4 Currently, most successful RL works, including open-source research, rely on relatively large base models, e.g., 32B models, particularly for enhancing code reasoning capabilities. Moreover, it was widely considered that achieving uniform and simultaneous improvements in both mathematical and code capabilities within a small model is challenging. Nonetheless, we believe that the effectiveness of the RL trained reasoning model relies on the inherent reasoning potential of the base model. To fully unlock the reasoning potential of language models, efforts must focus not only on post-training but also on pre-training strategies tailored to reasoning. In this work, we present MiMo-7B, a series of models trained from scratch and born for reasoning tasks. Our RL experiments from MiMo-7B-Base show that our model possesses extraordinary reasoning potential, even surpassing much larger 32B models. Additionally, we perform RL training on a cold-started SFT model, resulting in MiMo-7B-RL, which demonstrates superior performance on both mathematics and code reasoning tasks, matching the performance of OpenAI o1-mini. We open-source MiMo-7B series, including checkpoints of the base model, SFT model, RL model trained from base model, and RL model trained from the SFT model. We believe this report along with the models will provide valuable insights to develop powerful reasoning LLMs that benefit the larger community. - Pre-Training: Base Model Born for Reasoning - We optimize the data preprocessing pipeline, enhancing text extraction toolkits and applying multi-dimensional data filtering to increase reasoning pattern density in pre-training data. We also employ multiple strategies to generate massive diverse synthetic reasoning data. - We adopt a three-stage data mixture strategy for pre-training. Overall, MiMo-7B-Base is pre-trained on approximately 25 trillion tokens. - We incorporate Multiple-Token Prediction as an additional training objective, which enhances model performance and accelerates inference. - Post-Training Recipe: Pioneering Reasoning Model - We curate 130K mathematics and code problems as RL training data, which can be verified by rule-based verifiers. Each problem undergoes careful cleaning and difficulty assessment to ensure quality. We employ only rule-based accuracy rewards to avoid potential reward hacking. - To mitigate the sparse reward issue for challenging code problems, we introduce a test difficulty driven code reward. By assigning fine-grained scores for test cases with varying difficulty levels, the policy can be more effectively optimized via dense reward signal. - We implement a data re-sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates, particularly in the later phases of RL training. - RL Infrastructure - We develop a Seamless Rollout Engine to accelerate RL training and validation. Our design integrates continuous rollout, asynchronous reward computation, and early termination to minimize GPU idle time, achieving $2.29\times$ faster training and $1.96\times$ faster validation. - We support MTP in vLLM and enhance the robustness of the inference engine in the RL system. The MTP layers of MiMo-7B is tuned during pretraining and SFT and freezed during RL. With one MTP layer for speculative decoding, the acceptance rate is about 90%. > Models are available at https://huggingface.co/XiaomiMiMo and https://www.modelscope.cn/organization/XiaomiMiMo | Model | Description | Download (HuggingFace) | Download (ModelScope) | | :-------------: | :---------------------------------------------------------------------------: | :-------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------: | | MiMo-7B-Base | Base model with extraordinary reasoning potential | π€ XiaomiMiMo/MiMo-7B-Base | π€οΈ XiaomiMiMo/MiMo-7B-Base | | MiMo-7B-RL-Zero | RL model trained from base model | π€ XiaomiMiMo/MiMo-7B-RL-Zero | π€οΈ XiaomiMiMo/MiMo-7B-RL-Zero | | MiMo-7B-SFT | SFT model trained from base model | π€ XiaomiMiMo/MiMo-7B-SFT | π€οΈ XiaomiMiMo/MiMo-7B-SFT | | MiMo-7B-RL | RL model trained from SFT model, superior performance matching OpenAI o1-mini | π€ XiaomiMiMo/MiMo-7B-RL | π€οΈ XiaomiMiMo/MiMo-7B-RL | | Benchmark | GPT-4o-0513 | Claude-3.5-Sonnet-1022 | OpenAI o1-mini | QwQ-32B-Preview | R1-Distill-Qwen-14B | R1-Distill-Qwen-7B | MiMo-7B-RL | | ----------------------------- | :---------: | :--------------------: | :------------: | :-------------: | :-----------------: | :----------------: | :--------: | | General | | | | | | | | | GPQA Diamond (Pass@1) | 49.9 | 65.0 | 60.0 | 54.5 | 59.1 | 49.1 | 54.4 | | SuperGPQA (Pass@1) | 42.4 | 48.2 | 45.2 | 43.6 | 40.6 | 28.9 | 40.5 | | DROP (3-shot F1) | 83.7 | 88.3 | 83.9 | 71.2 | 85.5 | 77.0 | 78.7 | | MMLU-Pro (EM) | 72.6 | 78.0 | 80.3 | 52.0 | 68.8 | 53.5 | 58.6 | | IF-Eval (Prompt Strict) | 84.3 | 86.5 | 84.8 | 40.4 | 78.3 | 60.5 | 61.0 | | Mathematics | | | | | | | | | MATH-500 (Pass@1) | 74.6 | 78.3 | 90.0 | 90.6 | 93.9 | 92.8 | 95.8 | | AIME 2024 (Pass@1) | 9.3 | 16.0 | 63.6 | 50.0 | 69.7 | 55.5 | 68.2 | | AIME 2025 (Pass@1) | 11.6 | 7.4 | 50.7 | 32.4 | 48.2 | 38.8 | 55.4 | | Code | | | | | | | | | LiveCodeBench v5 (Pass@1) | 32.9 | 38.9 | 53.8 | 41.9 | 53.1 | 37.6 | 57.8 | | LiveCodeBench v6 (Pass@1) | 30.9 | 37.2 | 46.8 | 39.1 | 31.9 | 23.9 | 49.3 | | Benchmark | MiMo-7B-Base | MiMo-7B-RL-Zero | MiMo-7B-SFT | MiMo-7B-RL | | ----------------------------- | :----------: | :-------------: | :---------: | :--------: | | Mathematics | | | | | | MATH500 (Pass@1) | 37.4 | 93.6 | 93.0 | 95.8 | | AIME 2024 (Pass@1) | 32.9 | 56.4 | 58.7 | 68.2 | | AIME 2025 (Pass@1) | 24.3 | 46.3 | 44.3 | 55.4 | | Code | | | | | | LiveCodeBench v5 (Pass@1) | 32.9 | 49.1 | 52.3 | 57.8 | | LiveCodeBench v6 (Pass@1) | 29.1 | 42.9 | 45.5 | 49.3 | > [!IMPORTANT] > The evaluations are conducted with `temperature=0.6`. > > AIME24 and AIME25 are with averaged score of 32 repetitions. LiveCodeBench v5 (20240801-20250201), LiveCodeBench v6 (20250201-20250501), GPQA-Diamond and IF-Eval are with averaged score of 8 repetitions. MATH500 and SuperGPQA are with a single run. Thanks to the MiMo model support and MTP from the SGLang team, we supported MiMo in SGLang mainstream. 1. [Recommended] We officially support inference with MiMo-MTP using our fork of vLLM. 2. Or, you can register a vLLM loader for MiMo without loading MTP parameters. You can copy the `registry/registermimoinvllm.py` to your directory and import it with - We recommend using our fork of vLLM which is developed based on vLLM 0.7.3. - We recommend using empty system prompt. > We haven't verified MiMo with other inference engines and welcome contributions based on the model definition in the Huggingface repo π». Please contact us at [email protected] or open an issue if you have any questions.
MiMo-7B-RL
Unlocking the Reasoning Potential of Language Model From Pretraining to Posttraining | π€ HuggingFace | π€οΈ ModelScope | π Technical Report | [2025.05.30] We scaled the SFT dataset from approximately 500K to 6M instances and continuously expanding the RL training window size from 32K to 48K, the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1 (79.8). MATH500 (Pass@1) 95.8 97.2 AIME 2024 (Pass@1) 68.2 80.1 AIME 2025 (Pass@1) 55.4 70.2 Code LiveCodeBench v5 (Pass@1) 57.8 60.9 LiveCodeBench v6 (Pass@1) 49.3 52.2 STEM GPQA-Diamond (Pass@1) 54.4 60.6 General Alignbench1.1 (Evaluated by GPT4.1) 6.9 7.4 Currently, most successful RL works, including open-source research, rely on relatively large base models, e.g., 32B models, particularly for enhancing code reasoning capabilities. Moreover, it was widely considered that achieving uniform and simultaneous improvements in both mathematical and code capabilities within a small model is challenging. Nonetheless, we believe that the effectiveness of the RL trained reasoning model relies on the inherent reasoning potential of the base model. To fully unlock the reasoning potential of language models, efforts must focus not only on post-training but also on pre-training strategies tailored to reasoning. In this work, we present MiMo-7B, a series of models trained from scratch and born for reasoning tasks. Our RL experiments from MiMo-7B-Base show that our model possesses extraordinary reasoning potential, even surpassing much larger 32B models. Additionally, we perform RL training on a cold-started SFT model, resulting in MiMo-7B-RL, which demonstrates superior performance on both mathematics and code reasoning tasks, matching the performance of OpenAI o1-mini. We open-source MiMo-7B series, including checkpoints of the base model, SFT model, RL model trained from base model, and RL model trained from the SFT model. We believe this report along with the models will provide valuable insights to develop powerful reasoning LLMs that benefit the larger community. - Pre-Training: Base Model Born for Reasoning - We optimize the data preprocessing pipeline, enhancing text extraction toolkits and applying multi-dimensional data filtering to increase reasoning pattern density in pre-training data. We also employ multiple strategies to generate massive diverse synthetic reasoning data. - We adopt a three-stage data mixture strategy for pre-training. Overall, MiMo-7B-Base is pre-trained on approximately 25 trillion tokens. - We incorporate Multiple-Token Prediction as an additional training objective, which enhances model performance and accelerates inference. - Post-Training Recipe: Pioneering Reasoning Model - We curate 130K mathematics and code problems as RL training data, which can be verified by rule-based verifiers. Each problem undergoes careful cleaning and difficulty assessment to ensure quality. We employ only rule-based accuracy rewards to avoid potential reward hacking. - To mitigate the sparse reward issue for challenging code problems, we introduce a test difficulty driven code reward. By assigning fine-grained scores for test cases with varying difficulty levels, the policy can be more effectively optimized via dense reward signal. - We implement a data re-sampling strategy for easy problems to enhance rollout sampling efficiency and stabilize policy updates, particularly in the later phases of RL training. - RL Infrastructure - We develop a Seamless Rollout Engine to accelerate RL training and validation. Our design integrates continuous rollout, asynchronous reward computation, and early termination to minimize GPU idle time, achieving $2.29\times$ faster training and $1.96\times$ faster validation. - We support MTP in vLLM and enhance the robustness of the inference engine in the RL system. The MTP layers of MiMo-7B is tuned during pretraining and SFT and freezed during RL. With one MTP layer for speculative decoding, the acceptance rate is about 90%. > Models are available at https://huggingface.co/XiaomiMiMo and https://www.modelscope.cn/organization/XiaomiMiMo | Model | Description | Download (HuggingFace) | Download (ModelScope) | | :-------------: | :---------------------------------------------------------------------------: | :-------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------: | | MiMo-7B-Base | Base model with extraordinary reasoning potential | π€ XiaomiMiMo/MiMo-7B-Base | π€οΈ XiaomiMiMo/MiMo-7B-Base | | MiMo-7B-RL-Zero | RL model trained from base model | π€ XiaomiMiMo/MiMo-7B-RL-Zero | π€οΈ XiaomiMiMo/MiMo-7B-RL-Zero | | MiMo-7B-SFT | SFT model trained from base model | π€ XiaomiMiMo/MiMo-7B-SFT | π€οΈ XiaomiMiMo/MiMo-7B-SFT | | MiMo-7B-RL | RL model trained from SFT model, superior performance matching OpenAI o1-mini | π€ XiaomiMiMo/MiMo-7B-RL | π€οΈ XiaomiMiMo/MiMo-7B-RL | | Benchmark | GPT-4o-0513 | Claude-3.5-Sonnet-1022 | OpenAI o1-mini | QwQ-32B-Preview | R1-Distill-Qwen-14B | R1-Distill-Qwen-7B | MiMo-7B-RL | | ----------------------------- | :---------: | :--------------------: | :------------: | :-------------: | :-----------------: | :----------------: | :--------: | | General | | | | | | | | | GPQA Diamond (Pass@1) | 49.9 | 65.0 | 60.0 | 54.5 | 59.1 | 49.1 | 54.4 | | SuperGPQA (Pass@1) | 42.4 | 48.2 | 45.2 | 43.6 | 40.6 | 28.9 | 40.5 | | DROP (3-shot F1) | 83.7 | 88.3 | 83.9 | 71.2 | 85.5 | 77.0 | 78.7 | | MMLU-Pro (EM) | 72.6 | 78.0 | 80.3 | 52.0 | 68.8 | 53.5 | 58.6 | | IF-Eval (Prompt Strict) | 84.3 | 86.5 | 84.8 | 40.4 | 78.3 | 60.5 | 61.0 | | Mathematics | | | | | | | | | MATH-500 (Pass@1) | 74.6 | 78.3 | 90.0 | 90.6 | 93.9 | 92.8 | 95.8 | | AIME 2024 (Pass@1) | 9.3 | 16.0 | 63.6 | 50.0 | 69.7 | 55.5 | 68.2 | | AIME 2025 (Pass@1) | 11.6 | 7.4 | 50.7 | 32.4 | 48.2 | 38.8 | 55.4 | | Code | | | | | | | | | LiveCodeBench v5 (Pass@1) | 32.9 | 38.9 | 53.8 | 41.9 | 53.1 | 37.6 | 57.8 | | LiveCodeBench v6 (Pass@1) | 30.9 | 37.2 | 46.8 | 39.1 | 31.9 | 23.9 | 49.3 | | Benchmark | MiMo-7B-Base | MiMo-7B-RL-Zero | MiMo-7B-SFT | MiMo-7B-RL | | ----------------------------- | :----------: | :-------------: | :---------: | :--------: | | Mathematics | | | | | | MATH500 (Pass@1) | 37.4 | 93.6 | 93.0 | 95.8 | | AIME 2024 (Pass@1) | 32.9 | 56.4 | 58.7 | 68.2 | | AIME 2025 (Pass@1) | 24.3 | 46.3 | 44.3 | 55.4 | | Code | | | | | | LiveCodeBench v5 (Pass@1) | 32.9 | 49.1 | 52.3 | 57.8 | | LiveCodeBench v6 (Pass@1) | 29.1 | 42.9 | 45.5 | 49.3 | > [!IMPORTANT] > The evaluations are conducted with `temperature=0.6`. > > AIME24 and AIME25 are with averaged score of 32 repetitions. LiveCodeBench v5 (20240801-20250201), LiveCodeBench v6 (20250201-20250501), GPQA-Diamond and IF-Eval are with averaged score of 8 repetitions. MATH500 and SuperGPQA are with a single run. Thanks to the MiMo model support and MTP from the SGLang team, we supported MiMo in SGLang mainstream. 1. [Recommended] We officially support inference with MiMo-MTP using our fork of vLLM. 2. Or, you can register a vLLM loader for MiMo without loading MTP parameters. You can copy the `registry/registermimoinvllm.py` to your directory and import it with - We recommend using our fork of vLLM which is developed based on vLLM 0.7.3. - We recommend using empty system prompt. > We haven't verified MiMo with other inference engines and welcome contributions based on the model definition in the Huggingface repo π». Please contact us at [email protected] or open an issue if you have any questions.
MiMo-VL-7B-SFT
| π€ HuggingFace | π€οΈ ModelScope | π Technical Report | π» Github Repo | In this report, we share our efforts to build a compact yet powerful VLM, MiMo-VL-7B. MiMo-VL-7B comprises (1) a native resolution ViT encoder that preserves fine-grained visual details, (2) an MLP projector for efficient cross-modal alignment, and (3) our MiMo-7B language model, specifically optimized for complex reasoning tasks. The development of MiMo-VL-7B involves two sequential training processes: (1) A four-stage pre-training phase, which includes projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). This phase yields the MiMo-VL-7B-SFT model. (2) A subsequent post-training phase, where we introduce Mixed On-policy Reinforcement Learning (MORL), a novel framework that seamlessly integrates diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human/AI preferences. This phase yields the MiMo-VL-7B-RL model. We open-source MiMo-VL-7B series, including checkpoints of the SFT and RL model. We believe this report along with the models will provide valuable insights to develop powerful reasoning VLMs that benefit the larger community. - Incorporating high-quality, broad-coverage reasoning data from the pre-training stage is crucial for enhancing model performance - We curate high-quality reasoning data by identifying diverse queries, employing large reasoning models to regenerate responses with long CoT, and applying rejection sampling to ensure quality. - Rather than treating this as supplementary fine-tuning data, we incorporate substantial volumes of this synthetic reasoning data directly into the later pre-training stages, where extended training yields continued performance improvements without saturation. - Mixed On-policy Reinforcement Learning further enhances model performance, while achieving stable simultaneous improvements remains challenging - We apply RL across diverse capabilities, including reasoning, perception, grounding, and human preference alignment, spanning modalities including text, images, and videos. While this hybrid training approach further unlock modelβs potential, interference across data domains remains a challenge. > Models are available at Huggingface Collections: MiMo-VL and ModelScope Collections: MiMo-VL | Model | Description | Download (HuggingFace) | Download (ModelScope) | | :------------: | :-------------------------------------------------------------------: | :-----------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | | MiMo-VL-7B-SFT | VLM with extraordinary reasoning potential after 4-stage pre-training | π€ XiaomiMiMo/MiMo-VL-7B-SFT | π€οΈ XiaomiMiMo/MiMo-VL-7B-SFT | | MiMo-VL-7B-RL | RL model leapfrogging existing open-source models | π€ XiaomiMiMo/MiMo-VL-7B-RL | π€οΈ XiaomiMiMo/MiMo-VL-7B-RL | In general visual-language understanding, MiMo-VL-7B models achieve state-of-the-art open-source results. In multi-modal reasoning, both the SFT and RL models significantly outperform all compared open-source baselines across these benchmarks. > [!IMPORTANT] > Results marked with \ are obtained using our evaluation framework. > Tasks with ${\dagger}$ are evaluated by GPT-4o. MiMo-VL-7B-RL possess exceptional GUI understanding and grounding capabilities. As a general-purpose VL model, MiMo-VL achieves comparable or even superior performance to GUI-specialized models. With our in-house evaluation dataset and GPT-4o judgments, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models spanning from 7B to 72B parameters. The MiMo-VL-7B series maintain full compatibility with the `Qwen25VLForConditionalGeneration` architecture for deployment and inference. Please contact us at [email protected] or open an issue if you have any questions.
MiMo-VL-7B-RL
MiMo-Audio-7B-Instruct
MiMo Audio: Audio Language Models are Few-Shot Learners | π€ GitHub | π Paper | π° Blog | π₯ Online Demo | π MiMo-Audio-Eval | Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks, spoken dialogue benchmarks and instruct-TTS evaluations, approaching or surpassing closed-source models. Architecture MiMo-Audio-Tokenizer MiMo-Audio-Tokenizer is a 1.2B-parameter Transformer operating at 25 Hz. It employs an eight-layer RVQ stack to generate 200 tokens per second. By jointly optimizing semantic and reconstruction objectives, we train MiMo-Audio-Tokenizer from scratch on a 10-million-hour corpus, achieving superior reconstruction quality and facilitating downstream language modeling. MiMo-Audio couples a patch encoder, an LLM, and a patch decoder to improve modeling efficiency for high-rate sequences and bridge the length mismatch between speech and text. The patch encoder aggregates four consecutive time steps of RVQ tokens into a single patch, downsampling the sequence to a 6.25 Hz representation for the LLM. The patch decoder autoregressively generates the full 25 Hz RVQ token sequence via a delayed-generation scheme. MiMo-Audio Explore MiMo-Audio Now! πππ - π§ Try the Hugging Face demo: MiMo-Audio Demo - π° Read the Official Blog: MiMo-Audio Blog - π Dive into the Technical Report: MiMo-Audio Technical Report Model Download | Models | π€ Hugging Face | |-------|-------| | MiMo-Audio-Tokenizer | XiaomiMiMo/MiMo-Audio-Tokenizer | | MiMo-Audio-7B-Base | XiaomiMiMo/MiMo-Audio-7B-Base | | MiMo-Audio-7B-Instruct | XiaomiMiMo/MiMo-Audio-7B-Instruct | Spin up the MiMo-Audio demo in minutes with the built-in Gradio app. > \[!Note] > If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually: > > Download Precompiled Wheel > > This launches a local Gradio interface where you can try MiMo-Audio interactively. Enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-Audio-7B-Instruct`, then enjoy the full functionality of MiMo-Audio! Base Model We provide an example script to explore the in-context learning capabilities of `MiMo-Audio-7B-Base`. See: `inferenceexamplepretrain.py` Instruct Model To try the instruction-tuned model `MiMo-Audio-7B-Instruct`, use the corresponding inference script. See: `inferenceexamplesft.py` Evaluation Toolkit Full evaluation suite are available at πMiMo-Audio-Eval. This toolkit is designed to evaluate MiMo-Audio and other recent audio LLMs as mentioned in the paper. It provides a flexible and extensible framework, supporting a wide range of datasets, tasks, and models. Please contact us at [email protected] or open an issue if you have any questions.
MiMo-Audio-Tokenizer
MiMo-VL-7B-RL-2508
MiMo-VL-7B-SFT-2508
| π€ HuggingFace | π€οΈ ModelScope | π Technical Report | π Paper | We're excited to announce improvements to our MiMo-VL (MiMo-VL-7B-RL-2508 and MiMo-VL-7B-SFT-2508), featuring enhanced performance across multiple benchmarks, improved thinking control capabilities, and better user experience. MiMo-VL-7B-RL-2508 demonstrates consistent improvements across both image and video benchmarks, achieving notable milestones of 70.6 on MMMU and 70.8 on VideoMME. A thinking control capability that allows users to turn off the model's reasoning mode using the nothink parameter: - Thinking mode (default behavior): Full reasoning process visible with 100% control success rate; - Non-thinking mode: Direct responses without reasoning, with a 99.84% control success rate Our internal VLM Arena ratings show meaningful improvement in real-world performance: - Current model (MiMo-VL-7B-RL-2508): 1131.2 rating - Previous version (MiMo-VL-7B-RL): 1093.9 rating These updates deliver a more capable, flexible, and reliable vision-language model for both academic evaluation and practical applications. π Case Study: What are the appealing features of this car? Both versions of the MiMo-VL-7B-2508 model are now open-sourced on Hugging Face: - π€ MiMo-VL-7B-RL-2508 - Recommended for most users to experience and utilize. - π€ MiMo-VL-7B-SFT-2508 - Users may perform SFT and RL based on this model. Compared to the previous SFT version, this model demonstrates higher RL stability. Deployment Parameters - temperature=0.3, topp=0.95 - The system prompt is already set in `chattemplate.json` and does not require additional configuration. Users can control the thinking mode by appending `/nothink` to queries: - Thinking mode query (default): "What is the answer to the question in the image?" - Non-thinking mode query: "Identify the text in the image. /nothink" βοΈImportant: The `/nothink` command must be the very last part of user message, which means after `/nothink`, there shouldn't be any user content like image or video. Placing Visual Input For prompts with a single image or video, always place the visual media before the text. For example: In this report, we share our efforts to build a compact yet powerful VLM, MiMo-VL-7B. MiMo-VL-7B comprises (1) a native resolution ViT encoder that preserves fine-grained visual details, (2) an MLP projector for efficient cross-modal alignment, and (3) our MiMo-7B language model, specifically optimized for complex reasoning tasks. The development of MiMo-VL-7B involves two sequential training processes: (1) A four-stage pre-training phase, which includes projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). This phase yields the MiMo-VL-7B-SFT model. (2) A subsequent post-training phase, where we introduce Mixed On-policy Reinforcement Learning (MORL), a novel framework that seamlessly integrates diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human/AI preferences. This phase yields the MiMo-VL-7B-RL model. We open-source MiMo-VL-7B series, including checkpoints of the SFT and RL model. We believe this report along with the models will provide valuable insights to develop powerful reasoning VLMs that benefit the larger community. - Incorporating high-quality, broad-coverage reasoning data from the pre-training stage is crucial for enhancing model performance - We curate high-quality reasoning data by identifying diverse queries, employing large reasoning models to regenerate responses with long CoT, and applying rejection sampling to ensure quality. - Rather than treating this as supplementary fine-tuning data, we incorporate substantial volumes of this synthetic reasoning data directly into the later pre-training stages, where extended training yields continued performance improvements without saturation. - Mixed On-policy Reinforcement Learning further enhances model performance, while achieving stable simultaneous improvements remains challenging - We apply RL across diverse capabilities, including reasoning, perception, grounding, and human preference alignment, spanning modalities including text, images, and videos. While this hybrid training approach further unlock modelβs potential, interference across data domains remains a challenge. > Models are available at Huggingface Collections: MiMo-VL and ModelScope Collections: MiMo-VL | Model | Description | Download (HuggingFace) | Download (ModelScope) | | :------------: | :-------------------------------------------------------------------: | :-----------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | | MiMo-VL-7B-SFT | VLM with extraordinary reasoning potential after 4-stage pre-training | π€ XiaomiMiMo/MiMo-VL-7B-SFT | π€οΈ XiaomiMiMo/MiMo-VL-7B-SFT | | MiMo-VL-7B-RL | RL model leapfrogging existing open-source models | π€ XiaomiMiMo/MiMo-VL-7B-RL | π€οΈ XiaomiMiMo/MiMo-VL-7B-RL | In general visual-language understanding, MiMo-VL-7B models achieve state-of-the-art open-source results. In multi-modal reasoning, both the SFT and RL models significantly outperform all compared open-source baselines across these benchmarks. > [!IMPORTANT] > Results marked with \ are obtained using our evaluation framework. > Tasks with ${\dagger}$ are evaluated by GPT-4o. MiMo-VL-7B-RL possess exceptional GUI understanding and grounding capabilities. As a general-purpose VL model, MiMo-VL achieves comparable or even superior performance to GUI-specialized models. With our in-house evaluation dataset and GPT-4o judgments, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models spanning from 7B to 72B parameters. The MiMo-VL-7B series maintain full compatibility with the `Qwen25VLForConditionalGeneration` architecture for deployment and inference. Please contact us at [email protected] or open an issue if you have any questions.
MiMo-VL-7B-SFT-GGUF
MiMo VL 7B RL GGUF
| π€ HuggingFace | π€οΈ ModelScope | π Technical Report | π» Github Repo | In this report, we share our efforts to build a compact yet powerful VLM, MiMo-VL-7B. MiMo-VL-7B comprises (1) a native resolution ViT encoder that preserves fine-grained visual details, (2) an MLP projector for efficient cross-modal alignment, and (3) our MiMo-7B language model, specifically optimized for complex reasoning tasks. The development of MiMo-VL-7B involves two sequential training processes: (1) A four-stage pre-training phase, which includes projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). This phase yields the MiMo-VL-7B-SFT model. (2) A subsequent post-training phase, where we introduce Mixed On-policy Reinforcement Learning (MORL), a novel framework that seamlessly integrates diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human/AI preferences. This phase yields the MiMo-VL-7B-RL model. We open-source MiMo-VL-7B series, including checkpoints of the SFT and RL model. We believe this report along with the models will provide valuable insights to develop powerful reasoning VLMs that benefit the larger community. - Incorporating high-quality, broad-coverage reasoning data from the pre-training stage is crucial for enhancing model performance - We curate high-quality reasoning data by identifying diverse queries, employing large reasoning models to regenerate responses with long CoT, and applying rejection sampling to ensure quality. - Rather than treating this as supplementary fine-tuning data, we incorporate substantial volumes of this synthetic reasoning data directly into the later pre-training stages, where extended training yields continued performance improvements without saturation. - Mixed On-policy Reinforcement Learning further enhances model performance, while achieving stable simultaneous improvements remains challenging - We apply RL across diverse capabilities, including reasoning, perception, grounding, and human preference alignment, spanning modalities including text, images, and videos. While this hybrid training approach further unlock modelβs potential, interference across data domains remains a challenge. > Models are available at Huggingface Collections: MiMo-VL and ModelScope Collections: MiMo-VL | Model | Description | Download (HuggingFace) | Download (ModelScope) | | :------------: | :-------------------------------------------------------------------: | :-----------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------: | | MiMo-VL-7B-SFT | VLM with extraordinary reasoning potential after 4-stage pre-training | π€ XiaomiMiMo/MiMo-VL-7B-SFT | π€οΈ XiaomiMiMo/MiMo-VL-7B-SFT | | MiMo-VL-7B-RL | RL model leapfrogging existing open-source models | π€ XiaomiMiMo/MiMo-VL-7B-RL | π€οΈ XiaomiMiMo/MiMo-VL-7B-RL | In general visual-language understanding, MiMo-VL-7B models achieve state-of-the-art open-source results. In multi-modal reasoning, both the SFT and RL models significantly outperform all compared open-source baselines across these benchmarks. > [!IMPORTANT] > Results marked with \ are obtained using our evaluation framework. > Tasks with ${\dagger}$ are evaluated by GPT-4o. MiMo-VL-7B-RL possess exceptional GUI understanding and grounding capabilities. As a general-purpose VL model, MiMo-VL achieves comparable or even superior performance to GUI-specialized models. With our in-house evaluation dataset and GPT-4o judgments, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models spanning from 7B to 72B parameters. The MiMo-VL-7B series maintain full compatibility with the `Qwen25VLForConditionalGeneration` architecture for deployment and inference. Please contact us at [email protected] or open an issue if you have any questions.