HKUSTAudio

21 models • 2 total models in database

Sort by:

xcodec2

[](https://arxiv.org/abs/2502.04128) Update (2025-02-13): Add Llasa finetune instruction. LLaSA: Scaling Train Time and Inference Time Compute for LLaMA based Speech Synthesis Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0) Getting Started with XCodec2 on Hugging Face XCodec2 is a speech tokenizer that offers the following key features: 1. Single Vector Quantization 2. 50 Tokens per Second 3. Multilingual Speech Semantic Support and High-Quality Speech Reconstruction To use `xcodec2`, ensure you have it installed. You can install it using the following command: If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released here.

license:cc-by-nc-4.0

Llasa-1B

Update （2025-05-10): Sometimes I find that topp=0.95 and temperature=0.9 produce more stable results. Update (2025-02-13): Add Llasa finetune instruction. LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis - Train from Scratch: If you want to train the model from scratch, use the LLaSA Training Repository. - Scale for Test-Time Computation: If you want to experiment with scaling for test-time computation, use the LLaSA Testing Repository. Model Information Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data. The model is capable of generating speech either solely from input text or by utilizing a given speech prompt. 2. Speech synthesis utilizing a given speech prompt This model is licensed under the CC BY-NC 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences. This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.

Llasa-8B

Update (2025-02-13): Add Llasa finetune instruction. New Features We have observed that Llasa 8B exhibits excellent capability in text comprehension. You can try complex sentences like: - English: "He shouted, 'Everyone, please gather 'round! Here's the plan: 1) Set-up at 9:15 a.m.; 2) Lunch at 12:00 p.m. (please RSVP!); 3) Playing—e.g., games, music, etc.—from 1:15 to 4:45; and 4) Clean-up at 5 p.m.'" - Chinese: "昨夜雨疏风骤，浓睡不消残酒。试问卷帘人，却道海棠依旧。知否，知否？应是绿肥红瘦。" "帘外雨潺潺，春意阑珊。罗衾不耐五更寒。梦里不知身是客，一晌贪欢。独自莫凭栏，无限江山。别时容易见时难。流水落花春去也，天上人间。" Paper LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis (Comming soon) - Train from Scratch: If you want to train the model from scratch, use the LLaSA Training Repository. - Scale for Test-Time Computation: If you want to experiment with scaling for test-time computation, use the LLaSA Testing Repository. Model Information Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data. The model is capable of generating speech either solely from input text or by utilizing a given speech prompt. The method is seamlessly compatible with the Llama framework, making training TTS similar as training LLM (convert audios into single-codebook tokens and simply view it as a special language). It opens the possiblity of existing method for compression, acceleration and finetuning for LLM to be applied. 2. Speech synthesis utilizing a given speech prompt This model is licensed under the CC BY-NC 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences. This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.

Llasa 3B

Update （2025-05-10): Sometimes I find that topp=0.95 and temperature=0.9 produce more stable results. Update (2025-02-13): Add Llasa finetune instruction. LLaSA: Scaling Train-Time and Inference-Time Compute for LLaMA-based Speech Synthesis - Train from Scratch: If you want to train the model from scratch, use the LLaSA Training Repository. - Scale for Test-Time Computation: If you want to experiment with scaling for test-time computation, use the LLaSA Testing Repository. Model Information Our model, Llasa, is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, which contains 65,536 tokens. We trained Llasa on a dataset comprising 250,000 hours of Chinese-English speech data. The model is capable of generating speech either solely from input text or by utilizing a given speech prompt. The method is seamlessly compatible with the Llama framework, making training TTS similar as training LLM (convert audios into single-codebook tokens and simply view it as a special language). It opens the possiblity of existing method for compression, acceleration and finetuning for LLM to be applied. 2. Speech synthesis utilizing a given speech prompt This model is licensed under the CC BY-NC 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences. This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.

Llasa-1B-Multilingual

VidMuse

license:cc-by-4.0

Llasa-1B-multi-speakers-genshin-zh-en-ja-ko

AudioX-MAF

license:cc-by-nc-4.0

AudioX-MAF-MMDiT

license:cc-by-nc-4.0

YuE-s1-7B-anneal-jp-kr-icl

Llasa-3B-Preserve-TextChat

Spark-TTS-0.5B

license:cc-by-nc-sa-4.0

YuE-s2-1B-general

YuE-s1-7B-anneal-en-cot

YuE-s1-7B-anneal-zh-cot

YuE-s1-7B-anneal-en-icl

Llasa-1B-Preserve-TextChat

YuE-s1-7B-anneal-jp-kr-cot

AudioX

license:cc-by-nc-4.0

Llasa-1B-two-speakers-kore-puck

Audio-Omni

license:cc-by-nc-4.0