fnlp
MOSS-Speech
MOSS-Speech is an open-source bilingual native speech-to-speech model Without text guidance that supports both Chinese and English. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, leveraging pretrained text LLMs while extending native speech capabilities. Experiments show state-of-the-art results in spoken question answering and competitive speech-to-speech performance compared to text-guided systems. - True Speech-to-Speech Modeling: No text guidance required. - Layer-Splitting Architecture: Integrates modality-specific layers on top of pretrained text LLM backbones. - Frozen Pre-Training Strategy: Preserves LLM reasoning while enhancing speech understanding and generation. - State-of-the-Art Performance: Excels in spoken question answering and speech-to-speech tasks. - Expressive & Efficient: Maintains paralinguistic cues often lost in cascaded pipelines, such as tone, emotion, and prosody.
MOSS-Speech-Codec
RoboOmni
RoboOmni: Proactive Robot Manipulation in Omni-modal Context 📖 arXiv Paper | 🌐 Website | 🤗 Model | 🤗 Dataset | 🛠️ Github | Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution. If you find our paper and code useful in your research, please cite our paper.
RoboOmni-LIBERO-Spatial
RoboOmni: Proactive Robot Manipulation in Omni-modal Context 📖 arXiv Paper | 🌐 Website | 🤗 Model | 🤗 Dataset | 🛠️ Github | Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution. If you find our paper and code useful in your research, please cite our paper.
MOSS TTSD V0.7
MOSS-TTSD (text to spoken dialogue) is an open-source bilingual spoken dialogue synthesis model that supports both Chinese and English. It can transform dialogue scripts between two speakers into natural, expressive conversational speech. MOSS-TTSD supports voice cloning and single-session speech generation of up to 1700 seconds, making it ideal for AI podcast production. - Highly Expressive Dialogue Speech: Built on unified semantic-acoustic neural audio codec, a pre-trained large language model, millions of hours of TTS data and conversational speech, MOSS-TTSD generates highly expressive, human-like dialogue speech with natural conversational prosody. - Two-Speaker Voice Cloning: MOSS-TTSD supports zero-shot two speakers voice cloning and can generate conversational speech with accurate speaker swithcing based on dialogue scripts. - Chinese-English Bilingual Support: MOSS-TTSD enables highly expressive speech generation in both Chinese and English. - Long-Form Speech Generation (up to 1700 seconds): Thanks to low-bitrate codec and training framework optimization, MOSS-TTSD has been trained for long speech generation, enabling single-session speech generation of up to 1700 seconds. - Fully Open Source & Commercial-Ready: MOSS-TTSD and its future updates will be fully open-source and support free commercial use.
RoboOmni-LIBERO-Goal
RoboOmni: Proactive Robot Manipulation in Omni-modal Context 📖 arXiv Paper | 🌐 Website | 🤗 Model | 🤗 Dataset | 🛠️ Github | Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution. If you find our paper and code useful in your research, please cite our paper.
RoboOmni-LIBERO-Object
RoboOmni: Proactive Robot Manipulation in Omni-modal Context 📖 arXiv Paper | 🌐 Website | 🤗 Model | 🤗 Dataset | 🛠️ Github | Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. At the heart of RoboOmni lies the Perceiver-Thinker-Talker-Executor architecture, which unifies multiple modalities (vision, speech, environmental sounds) into a single, seamless framework for robot action execution. If you find our paper and code useful in your research, please cite our paper.
RoboOmni-LIBERO-Long
Llama 2 7B MHA D Kv 256
Research Paper "Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs" - Step 2(Option): For MHA2MLA models using Partial-RoPE 2-nrom method, Download the qk2-norm file. Take `qktensor7B.pth` as an example: - Step 3: Download the MHA2MLA models and run inference. Take `fnlp/Llama-2-7B-MHA-dkv256` as an example: