ICTNLP
llava-mini-llama-3.1-8b
Llama-3.1-8B-Omni
Auto-RAG-Llama-3-8B-Instruct
SLED-TTS-Streaming-Libriheavy
SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space [](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) [](https://www.wechat.com) [](https://ict.cas.cn) Key features - Autoregressive Continuous Modeling: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. - Streaming Synthesis: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. - Voice Cloning: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. Demo You can check SLED in action by exploring the demo page. We have made SLED available on Hugging Face, currently offering two distinct English models for different use cases: 1. SLED-TTS-Libriheavy: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. 2. SLED-TTS-Streaming-Libriheavy: This variant supports streaming decoding, which generates a 0.6-second speech chunk for every 5 text tokens received. Itβs ideal for applications requiring low-latency audio generation. The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. Usage We provide the training and inference code for SLED-TTS. We currently utilize the sum of the first 8 embedding vectors from Encodec24khz as the continuous latent vector. To proceed, ensure that Encodec24khz is downloaded and cached in your HuggingFace dir. Inference - Set the `CHECKPOINT` variable to the path of the cached SLED-TTS-Libriheavy or SLED-TTS-Streaming-Libriheavy model. - Diverse generation results can be obtained by varying the `SEED` variable. You can adjust the prompt speech by setting `--prompttext` and `--promptaudio`. Ackonwledgement This work is inspired by following great works: - Continuous Visual Autoregressive Generation via Score Maximization - Autoregressive Image Generation without Vector Quantization - A Spectral Energy Distance for Parallel Speech Synthesis
Llama-2-7b-chat-TruthX
LLaMA-Omni2-0.5B
Stream Omni 8b
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model [](https://arxiv.org/abs/2506.13642) [](https://github.com/ictnlp/Stream-Omni) [](https://huggingface.co/ICTNLP/stream-omni-8b) [](https://huggingface.co/datasets/ICTNLP/InstructOmni) [](https://github.com/ictnlp/Stream-Omni) > Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng\ The introduction and usage of Stream-Omni refer to https://github.com/ictnlp/Stream-Omni. Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following featuresπ‘: - Omni Interaction: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses. - Seamless "see-while-hear" Experience: Simultaneously output intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o. - Efficient Training: Require only a small amount of omni-modal data for training. π₯ Demo | Microphone Input | File Input | | ------------------------------------------------------------ | ------------------------------------------------------------ | | | | > [!NOTE] > > Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.
bayling-13b-v1.1
LLaMA-Omni2-7B
LLaMA-Omni2-7B-Bilingual
bayling-13b-diff
LLaMA-Omni2-0.5B-Bilingual
LLaMA-Omni2-14B-Bilingual
LLaMA-Omni2-1.5B
LLaMA-Omni2-3B
FastLongSpeech
bayling-2-llama-3-8b
LLaMA-Omni2-14B
LLaMA-Omni2-1.5B-Bilingual
LLaMA-Omni2-3B-Bilingual
SLED-TTS-Libriheavy
SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space [](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) [](https://www.wechat.com) [](https://ict.cas.cn) Key features - Autoregressive Continuous Modeling: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. - Streaming Synthesis: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. - Voice Cloning: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. Demo You can check SLED in action by exploring the demo page. We have made SLED available on Hugging Face, currently offering two distinct English models for different use cases: 1. SLED-TTS-Libriheavy: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. 2. SLED-TTS-Streaming-Libriheavy: This variant supports streaming decoding, which generates a 0.6-second speech chunk for every 5 text tokens received. Itβs ideal for applications requiring low-latency audio generation. The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. Usage We provide the training and inference code for SLED-TTS. We currently utilize the sum of the first 8 embedding vectors from Encodec24khz as the continuous latent vector. To proceed, ensure that Encodec24khz is downloaded and cached in your HuggingFace dir. Inference - Set the `CHECKPOINT` variable to the path of the cached SLED-TTS-Libriheavy or SLED-TTS-Streaming-Libriheavy model. - Diverse generation results can be obtained by varying the `SEED` variable. You can adjust the prompt speech by setting `--prompttext` and `--promptaudio`. Ackonwledgement This work is inspired by following great works: - Continuous Visual Autoregressive Generation via Score Maximization - Autoregressive Image Generation without Vector Quantization - A Spectral Energy Distance for Parallel Speech Synthesis
bayling-2-7b
LLaMA-Omni2-32B-Bilingual
StreamSpeech_Models
bayling-7b-diff
TruthX
StreamUni-Phi4
The model for the paper 'StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model' Phi-4 family has been integrated in the `4.48.2` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. We suggest to run with Python 3.10. Examples of required packages: Training Datasets - https://huggingface.co/datasets/ICTNLP/StreamUni Github Pages - https://github.com/ictnlp/StreamUni