sensefvg
InteractiveOmni-4B
InteractiveOmni-4B 🤗 | InteractiveOmni-8B 🤗 | 📑 Paper Introduction InteractiveOmni is a unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech streams, achieving truly integrated interaction. This is the schematic diagram for multi-turn audio-visual interaction. Key Features Strong Performance Across Modalities: Exhibiting omni-modal understanding and speech generation capabilities. InteractiveOmni outperforms the similarly sized vision-language models, audio-language models and omni-modal models. State-of-the-Art Performance: Achieve SOTA results on various open-source benchmarks for image, audio, and video understanding, as well as speech conversation. Excellent Interactive Performance: Achieve more intelligent audio-visual experience with multi-turn and long-term memory capabilities. Multi-turn Interactive Benchmarks: Propose multi-modal multi-turn benchmark to evaluate multi-turn memory and speech interaction of leading MLLMs. On-device Model: the 4B model achieves 97% of the performance with just 50% of the model size compared with 8B model. Model Architecture We provide an example code to run `InteractiveOmni` using 🤗 `Transformers`. > Please use transformers>=4.51.0 and FlashAttention2 to ensure the model works normally. Model Loading Use audio output If users need audio output, the system prompt must be set as follows, otherwise the audio output may not work as expected. Use custom speaker to generate output audio, similar to sound cloning. Evaluation InteractiveOmni achieves state-of-the-art performance across a wide range of multi-modal understanding and speech generation benchmarks. Model MMBench MMStar MMMU MathVista HallusionBench AI2D OCRBench Avg InternVL3-8B 82.1 68.7 62.2 70.5 49.0 85.1 88.4 72.3 InternVL3.5-8B 79.5 69.3 73.4 78.4 54.5 84.0 84.0 74.7 Qwen2.5-VL-7B 82.2 64.1 58.0 68.1 51.9 84.3 88.8 71.1 GPT-4o-mini 76.0 54.8 60.0 52.5 46.1 77.8 78.5 63.7 Ming-Lite-Omni 80.8 64.7 56.3 71.6 55.0 83.1 88.4 71.4 Qwen2.5-Omni-7B 81.3 64.0 59.2 67.9 47.4 83.2 83.4 69.5 InteractiveOmni-4B 78.9 62.6 61.1 61.7 52.2 83.8 80.0 68.6 InteractiveOmni-8B 81.4 66.8 66.9 68.0 61.3 84.3 83.7 73.2 Model Video-MME (wo sub) Video-MME (w sub) MLVU (M-Avg) LongVideoBench (val total) Avg Model Qwen2-Audio Step-Audio-Chat Kimi-Audio Qwen2.5-Omni-7B InteractiveOmni-4B InteractiveOmni-8B Wenetspeech test-net 10.60 8.75 5.37 5.90 5.40 5.04 Wenetspeech test-meeting 10.68 9.52 6.28 7.70 6.95 5.55 LibriSpeech test-clean 1.60 3.19 1.28 1.80 1.73 1.64 LibriSpeech test-other 3.60 10.67 2.42 3.40 3.69 3.41 OpenAudioBench Reasoning QA | Llama Questions | Web Questions | TriviaQA | AlpacaEval | Avg Qwen2-Audio 42.77 | 69.67 | 45.20 | 40.30 | 57.19 | 51.03 GLM-4-Voice 47.43 | 76.00 | 55.40 | 51.80 | 57.89 | 57.70 VITA-1.5 41.00 | 74.20 | 57.30 | 46.80 | 68.20 | 57.50 Step-Audio-chat 60.00 | 72.33 | 73.00 | 56.80 | 56.53 | 63.73 Baichuan-Audio 41.90 | 78.40 | 64.50 | 61.70 | 77.40 | 64.78 Kimi-Audio 58.02 | 79.33 | 70.20 | 62.10 | 75.73 | 69.08 MiniCPM-o-2.6 38.60 | 77.80 | 68.60 | 61.90 | 51.80 | 59.74 Baichuan-Omni-1.5 50.00 | 78.50 | 59.10 | 57.20 | 77.90 | 64.54 Qwen2.5-Omni-7B 63.76 | 75.33 | 62.80 | 57.06 | 72.76 | 66.34 InteractiveOmni-4B 69.11 | 79.33 | 65.80 | 56.40 | 74.87 | 69.10 InteractiveOmni-8B 71.68 | 80.67 | 70.30 | 66.50 | 74.57 | 72.74 VoiceBench AlpacaEval | CommonEval | WildVoice | SD-QA | MMSU Qwen2-Audio 3.69 | 3.40 | 3.01 | 35.35 | 35.43 Baichuan-Omni-1.5 4.50 | 4.05 | 4.06 | 43.40 | 57.25 InteractiveOmni-4B 4.27 | 4.20 | 3.94 | 41.41 | 63.24 InteractiveOmni-8B 4.61 | 4.34 | 4.21 | 44.67 | 65.26 VoiceBench OpenBookQA | IFEval | BBH | AdvBench | Avg Qwen2-Audio 49.01 | 54.70 | 22.57 | 98.85 | 55.32 Step-Audio-chat 31.87 | 50.60 | 29.19 | 65.77 | 50.13 Baichuan-Audio 71.65 | 54.80 | 50.31 | 99.42 | 69.27 MiniCPM-o-2.6 78.02 | 60.40 | 49.25 | 97.69 | 71.23 Baichuan-Omni-1.5 74.51 | 62.70 | 54.54 | 97.31 | 71.32 Qwen2.5-Omni-7B 80.90 | 66.70 | 53.50 | 99.20 | 73.60 InteractiveOmni-4B 82.64 | 55.90 | 60.90 | 99.62 | 73.10 InteractiveOmni-8B 86.37 | 73.30 | 57.99 | 99.42 | 76.69 Citation If you find our paper and code useful in your research, please cite our technical report. ```bibtex @misc{tong2025interactiveomniunifiedomnimodalmodel, title={InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue}, author={Wenwen Tong and Hewei Guo and Dongchuan Ran and Jiangnan Chen and Jiefan Lu and Kaibin Wang and Keqiang Li and Xiaoxu Zhu and Jiakui Li and Kehan Li and Xueheng Li and Lumin Li and Chenxu Guo and Jiasheng Zhou and Jiandong Chen and Xianye Wu and Jiahao Wang and Silei Wu and Lei Chen and Hanming Deng and Yuxuan Song and Dinghao Zhou and Guiping Zhong and Ken Zheng and Shiyin Kang and Lewei Lu}, year={2025}, eprint={2510.13747}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13747}, }
InteractiveOmni-8B
InteractiveOmni-4B 🤗 | InteractiveOmni-8B 🤗 | 📑 Paper Introduction InteractiveOmni is a unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech streams, achieving truly integrated interaction. This is the schematic diagram for multi-turn audio-visual interaction. Key Features Strong Performance Across Modalities: Exhibiting omni-modal understanding and speech generation capabilities. InteractiveOmni outperforms the similarly sized vision-language models, audio-language models and omni-modal models. State-of-the-Art Performance: Achieve SOTA results on various open-source benchmarks for image, audio, and video understanding, as well as speech conversation. Excellent Interactive Performance: Achieve more intelligent audio-visual experience with multi-turn and long-term memory capabilities. Multi-turn Interactive Benchmarks: Propose multi-modal multi-turn benchmark to evaluate multi-turn memory and speech interaction of leading MLLMs. On-device Model: the 4B model achieves 97% of the performance with just 50% of the model size compared with 8B model. Model Architecture We provide an example code to run `InteractiveOmni` using 🤗 `Transformers`. > Please use transformers>=4.51.0 and FlashAttention2 to ensure the model works normally. Model Loading Use audio output If users need audio output, the system prompt must be set as follows, otherwise the audio output may not work as expected. Use custom speaker to generate output audio, similar to sound cloning. Evaluation InteractiveOmni achieves state-of-the-art performance across a wide range of multi-modal understanding and speech generation benchmarks. Model MMBench MMStar MMMU MathVista HallusionBench AI2D OCRBench Avg InternVL3-8B 82.1 68.7 62.2 70.5 49.0 85.1 88.4 72.3 InternVL3.5-8B 79.5 69.3 73.4 78.4 54.5 84.0 84.0 74.7 Qwen2.5-VL-7B 82.2 64.1 58.0 68.1 51.9 84.3 88.8 71.1 GPT-4o-mini 76.0 54.8 60.0 52.5 46.1 77.8 78.5 63.7 Ming-Lite-Omni 80.8 64.7 56.3 71.6 55.0 83.1 88.4 71.4 Qwen2.5-Omni-7B 81.3 64.0 59.2 67.9 47.4 83.2 83.4 69.5 InteractiveOmni-4B 78.9 62.6 61.1 61.7 52.2 83.8 80.0 68.6 InteractiveOmni-8B 81.4 66.8 66.9 68.0 61.3 84.3 83.7 73.2 Model Video-MME (wo sub) Video-MME (w sub) MLVU (M-Avg) LongVideoBench (val total) Avg Model Qwen2-Audio Step-Audio-Chat Kimi-Audio Qwen2.5-Omni-7B InteractiveOmni-4B InteractiveOmni-8B Wenetspeech test-net 10.60 8.75 5.37 5.90 5.40 5.04 Wenetspeech test-meeting 10.68 9.52 6.28 7.70 6.95 5.55 LibriSpeech test-clean 1.60 3.19 1.28 1.80 1.73 1.64 LibriSpeech test-other 3.60 10.67 2.42 3.40 3.69 3.41 OpenAudioBench Reasoning QA | Llama Questions | Web Questions | TriviaQA | AlpacaEval | Avg Qwen2-Audio 42.77 | 69.67 | 45.20 | 40.30 | 57.19 | 51.03 GLM-4-Voice 47.43 | 76.00 | 55.40 | 51.80 | 57.89 | 57.70 VITA-1.5 41.00 | 74.20 | 57.30 | 46.80 | 68.20 | 57.50 Step-Audio-chat 60.00 | 72.33 | 73.00 | 56.80 | 56.53 | 63.73 Baichuan-Audio 41.90 | 78.40 | 64.50 | 61.70 | 77.40 | 64.78 Kimi-Audio 58.02 | 79.33 | 70.20 | 62.10 | 75.73 | 69.08 MiniCPM-o-2.6 38.60 | 77.80 | 68.60 | 61.90 | 51.80 | 59.74 Baichuan-Omni-1.5 50.00 | 78.50 | 59.10 | 57.20 | 77.90 | 64.54 Qwen2.5-Omni-7B 63.76 | 75.33 | 62.80 | 57.06 | 72.76 | 66.34 InteractiveOmni-4B 69.11 | 79.33 | 65.80 | 56.40 | 74.87 | 69.10 InteractiveOmni-8B 71.68 | 80.67 | 70.30 | 66.50 | 74.57 | 72.74 VoiceBench AlpacaEval | CommonEval | WildVoice | SD-QA | MMSU Qwen2-Audio 3.69 | 3.40 | 3.01 | 35.35 | 35.43 Baichuan-Omni-1.5 4.50 | 4.05 | 4.06 | 43.40 | 57.25 InteractiveOmni-4B 4.27 | 4.20 | 3.94 | 41.41 | 63.24 InteractiveOmni-8B 4.61 | 4.34 | 4.21 | 44.67 | 65.26 VoiceBench OpenBookQA | IFEval | BBH | AdvBench | Avg Qwen2-Audio 49.01 | 54.70 | 22.57 | 98.85 | 55.32 Step-Audio-chat 31.87 | 50.60 | 29.19 | 65.77 | 50.13 Baichuan-Audio 71.65 | 54.80 | 50.31 | 99.42 | 69.27 MiniCPM-o-2.6 78.02 | 60.40 | 49.25 | 97.69 | 71.23 Baichuan-Omni-1.5 74.51 | 62.70 | 54.54 | 97.31 | 71.32 Qwen2.5-Omni-7B 80.90 | 66.70 | 53.50 | 99.20 | 73.60 InteractiveOmni-4B 82.64 | 55.90 | 60.90 | 99.62 | 73.10 InteractiveOmni-8B 86.37 | 73.30 | 57.99 | 99.42 | 76.69 Citation If you find our paper and code useful in your research, please cite our technical report. ```bibtex @misc{tong2025interactiveomniunifiedomnimodalmodel, title={InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue}, author={Wenwen Tong and Hewei Guo and Dongchuan Ran and Jiangnan Chen and Jiefan Lu and Kaibin Wang and Keqiang Li and Xiaoxu Zhu and Jiakui Li and Kehan Li and Xueheng Li and Lumin Li and Chenxu Guo and Jiasheng Zhou and Jiandong Chen and Xianye Wu and Jiahao Wang and Silei Wu and Lei Chen and Hanming Deng and Yuxuan Song and Dinghao Zhou and Guiping Zhong and Ken Zheng and Shiyin Kang and Lewei Lu}, year={2025}, eprint={2510.13747}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.13747}, }