ICTNLP

30 models • 1 total models in database

Sort by:

SLED-TTS-Streaming-Libriheavy

SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space [](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) [](https://www.wechat.com) [](https://ict.cas.cn) Key features - Autoregressive Continuous Modeling: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. - Streaming Synthesis: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. - Voice Cloning: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. Demo You can check SLED in action by exploring the demo page. We have made SLED available on Hugging Face, currently offering two distinct English models for different use cases: 1. SLED-TTS-Libriheavy: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. 2. SLED-TTS-Streaming-Libriheavy: This variant supports streaming decoding, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation. The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. Usage We provide the training and inference code for SLED-TTS. We currently utilize the sum of the first 8 embedding vectors from Encodec24khz as the continuous latent vector. To proceed, ensure that Encodec24khz is downloaded and cached in your HuggingFace dir. Inference - Set the `CHECKPOINT` variable to the path of the cached SLED-TTS-Libriheavy or SLED-TTS-Streaming-Libriheavy model. - Diverse generation results can be obtained by varying the `SEED` variable. You can adjust the prompt speech by setting `--prompttext` and `--promptaudio`. Ackonwledgement This work is inspired by following great works: - Continuous Visual Autoregressive Generation via Score Maximization - Autoregressive Image Generation without Vector Quantization - A Spectral Energy Distance for Parallel Speech Synthesis

speech_llama

Llama-2-7b-chat-TruthX

NaNK

llama

LLaMA-Omni2-0.5B

NaNK

—

Stream Omni 8b

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model [](https://arxiv.org/abs/2506.13642) [](https://github.com/ictnlp/Stream-Omni) [](https://huggingface.co/ICTNLP/stream-omni-8b) [](https://huggingface.co/datasets/ICTNLP/InstructOmni) [](https://github.com/ictnlp/Stream-Omni) > Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng\ The introduction and usage of Stream-Omni refer to https://github.com/ictnlp/Stream-Omni. Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following features💡: - Omni Interaction: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses. - Seamless "see-while-hear" Experience: Simultaneously output intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o. - Efficient Training: Require only a small amount of omni-modal data for training. 🖥 Demo | Microphone Input | File Input | | ------------------------------------------------------------ | ------------------------------------------------------------ | | | | > [!NOTE] > > Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.

speech_llama

bayling-2-7b

NaNK

NAST-S2X

license:apache-2.0

TACS_Truth_Detection_Classifiers

—

ComSpeech_Models

—

ICTNLP

llava-mini-llama-3.1-8b

Llama-3.1-8B-Omni

Auto-RAG-Llama-3-8B-Instruct

SLED-TTS-Streaming-Libriheavy

Llama-2-7b-chat-TruthX

LLaMA-Omni2-0.5B

Stream Omni 8b

bayling-13b-v1.1

LLaMA-Omni2-7B

LLaMA-Omni2-7B-Bilingual

bayling-13b-diff

LLaMA-Omni2-0.5B-Bilingual

LLaMA-Omni2-14B-Bilingual

LLaMA-Omni2-1.5B

LLaMA-Omni2-3B

FastLongSpeech

bayling-2-llama-3-8b

LLaMA-Omni2-14B

LLaMA-Omni2-1.5B-Bilingual

LLaMA-Omni2-3B-Bilingual

SLED-TTS-Libriheavy

bayling-2-7b

LLaMA-Omni2-32B-Bilingual

StreamSpeech_Models

bayling-7b-diff

TruthX

StreamUni-Phi4

NAST-S2X

TACS_Truth_Detection_Classifiers

ComSpeech_Models