ICTNLP

30 models β€’ 1 total models in database
Sort by:

llava-mini-llama-3.1-8b

NaNK
llava_mini_llama
803
56

Llama-3.1-8B-Omni

NaNK
llama-omni
169
413

Auto-RAG-Llama-3-8B-Instruct

NaNK
llama
164
6

SLED-TTS-Streaming-Libriheavy

SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space [](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) [](https://www.wechat.com) [](https://ict.cas.cn) Key features - Autoregressive Continuous Modeling: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. - Streaming Synthesis: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. - Voice Cloning: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. Demo You can check SLED in action by exploring the demo page. We have made SLED available on Hugging Face, currently offering two distinct English models for different use cases: 1. SLED-TTS-Libriheavy: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. 2. SLED-TTS-Streaming-Libriheavy: This variant supports streaming decoding, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation. The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. Usage We provide the training and inference code for SLED-TTS. We currently utilize the sum of the first 8 embedding vectors from Encodec24khz as the continuous latent vector. To proceed, ensure that Encodec24khz is downloaded and cached in your HuggingFace dir. Inference - Set the `CHECKPOINT` variable to the path of the cached SLED-TTS-Libriheavy or SLED-TTS-Streaming-Libriheavy model. - Diverse generation results can be obtained by varying the `SEED` variable. You can adjust the prompt speech by setting `--prompttext` and `--promptaudio`. Ackonwledgement This work is inspired by following great works: - Continuous Visual Autoregressive Generation via Score Maximization - Autoregressive Image Generation without Vector Quantization - A Spectral Energy Distance for Parallel Speech Synthesis

speech_llama
34
7

Llama-2-7b-chat-TruthX

NaNK
llama
24
6

LLaMA-Omni2-0.5B

NaNK
β€”
16
6

Stream Omni 8b

Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model [](https://arxiv.org/abs/2506.13642) [](https://github.com/ictnlp/Stream-Omni) [](https://huggingface.co/ICTNLP/stream-omni-8b) [](https://huggingface.co/datasets/ICTNLP/InstructOmni) [](https://github.com/ictnlp/Stream-Omni) > Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng\ The introduction and usage of Stream-Omni refer to https://github.com/ictnlp/Stream-Omni. Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following featuresπŸ’‘: - Omni Interaction: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses. - Seamless "see-while-hear" Experience: Simultaneously output intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o. - Efficient Training: Require only a small amount of omni-modal data for training. πŸ–₯ Demo | Microphone Input | File Input | | ------------------------------------------------------------ | ------------------------------------------------------------ | | | | > [!NOTE] > > Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.

NaNK
stream_omni_llama
15
48

bayling-13b-v1.1

NaNK
llama
14
5

LLaMA-Omni2-7B

NaNK
β€”
12
4

LLaMA-Omni2-7B-Bilingual

NaNK
β€”
12
1

bayling-13b-diff

NaNK
llama
10
12

LLaMA-Omni2-0.5B-Bilingual

NaNK
β€”
9
1

LLaMA-Omni2-14B-Bilingual

NaNK
β€”
7
0

LLaMA-Omni2-1.5B

NaNK
β€”
6
2

LLaMA-Omni2-3B

NaNK
β€”
5
2

FastLongSpeech

NaNK
license:apache-2.0
5
1

bayling-2-llama-3-8b

NaNK
llama
4
1

LLaMA-Omni2-14B

NaNK
β€”
4
1

LLaMA-Omni2-1.5B-Bilingual

NaNK
β€”
4
0

LLaMA-Omni2-3B-Bilingual

NaNK
β€”
3
0

SLED-TTS-Libriheavy

SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space [](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac) [](https://www.wechat.com) [](https://ict.cas.cn) Key features - Autoregressive Continuous Modeling: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective. - Streaming Synthesis: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins. - Voice Cloning: Capable of generating speech based on a 3-second prefix or reference utterance as prompt. Demo You can check SLED in action by exploring the demo page. We have made SLED available on Hugging Face, currently offering two distinct English models for different use cases: 1. SLED-TTS-Libriheavy: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis. 2. SLED-TTS-Streaming-Libriheavy: This variant supports streaming decoding, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation. The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below. Usage We provide the training and inference code for SLED-TTS. We currently utilize the sum of the first 8 embedding vectors from Encodec24khz as the continuous latent vector. To proceed, ensure that Encodec24khz is downloaded and cached in your HuggingFace dir. Inference - Set the `CHECKPOINT` variable to the path of the cached SLED-TTS-Libriheavy or SLED-TTS-Streaming-Libriheavy model. - Diverse generation results can be obtained by varying the `SEED` variable. You can adjust the prompt speech by setting `--prompttext` and `--promptaudio`. Ackonwledgement This work is inspired by following great works: - Continuous Visual Autoregressive Generation via Score Maximization - Autoregressive Image Generation without Vector Quantization - A Spectral Energy Distance for Parallel Speech Synthesis

speech_llama
2
6

bayling-2-7b

NaNK
llama
1
1

LLaMA-Omni2-32B-Bilingual

NaNK
β€”
1
1

StreamSpeech_Models

β€”
0
32

bayling-7b-diff

NaNK
llama
0
8

TruthX

license:gpl-3.0
0
5

StreamUni-Phi4

The model for the paper 'StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model' Phi-4 family has been integrated in the `4.48.2` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. We suggest to run with Python 3.10. Examples of required packages: Training Datasets - https://huggingface.co/datasets/ICTNLP/StreamUni Github Pages - https://github.com/ictnlp/StreamUni

NaNK
license:apache-2.0
0
5

NAST-S2X

license:apache-2.0
0
4

TACS_Truth_Detection_Classifiers

β€”
0
2

ComSpeech_Models

β€”
0
2