Soul-AILab

9 models • 3 total models in database

Sort by:

SoulX-Podcast-1.7B

Official inference code for SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity Overview SoulX-Podcast is designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving superior performance in the conventional monologue TTS task. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. - Long-form, multi-turn, multi-speaker dialogic speech generation: SoulX-Podcast excels in generating high-quality, natural-sounding dialogic speech for multi-turn, multi-speaker scenarios. - Cross-dialectal, zero-shot voice cloning: SoulX-Podcast supports zero-shot voice cloning across different Chinese dialects, enabling the generation of high-quality, personalized speech in any of the supported dialects. - Paralinguistic controls: SoulX-Podcast supports a variety of paralinguistic events, as as laugher and sighs to enhance the realism of synthesized results. Clone and Install Here are instructions for installing on Linux. - Clone the repo - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html - Create Conda env: You can simply run the demo with the following commands: TODOs - [ ] Add example scripts for monologue TTS. - [x] Publish the technical report. - [ ] Develop a WebUI for easy inference. - [ ] Deploy an online demo on Hugging Face Spaces. - [ ] Dockerize the project with vLLM support. - [ ] Add support for streaming inference. We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our SoulX-Podcast. Check the license at LICENSE for more details. Acknowledge - This repo benefits from FlashCosyVoice Usage Disclaimer This project provides a speech synthesis model for podcast generation capable of zero-shot voice cloning, intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research. Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal activities. Ensure compliance with local laws and regulations when using this model and uphold ethical standards. The developers assume no liability for any misuse of this model. We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us. Contact Us If you are interested in leaving a message to our work, feel free to email [email protected] or [email protected] or [email protected] or [email protected] You’re welcome to join our WeChat group for technical discussions, updates. Due to group limits, if you can't scan the QR code, please add my WeChat for group access --> Tiamo James -->

LiveAct

license:apache-2.0

492

SoulX-Podcast-1.7B-dialect

NaNK

license:apache-2.0

418

SAC-16k-37_5Hz

license:apache-2.0

SAC 16k 62 5Hz

SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization A semantic–acoustic dual-stream speech codec achieving state-of-the-art performance in speech reconstruction and semantic representation across bitrates. To use SAC, you need to prepare the pretrained dependencies, including the GLM-4-Voice-Tokenizer for semantic tokenization and the ERes2Net speaker encoder for speaker feature extraction (during codec training). Make sure the corresponding model paths are correctly set in your configuration file (e.g., `configs/xxx.yaml`). The following table lists the available SAC checkpoints: | Model Name | Hugging Face | Sample Rate | Token Rate | BPS | |:-----------:|:------------:|:------------:|:-----------:|:---:| | SAC | 🤗 Soul-AILab/SAC-16k-375Hz | 16 kHz | 37.5 Hz | 525 | | SAC | 🤗 Soul-AILab/SAC-16k-625Hz | 16 kHz | 62.5 Hz | 875 | To perform audio reconstruction, you can use the following command: We also provide batch scripts for audio reconstruction, encoding, decoding, and embedding extraction in the `scripts/batch` directory as references (you can refer to the batch scripts guide for details). You can run the following command to perform evaluation: For details on dataset preparation and evaluation setup, please first refer to the evaluation guide. 🚀 Training Step 1: Prepare training data Before training, organize your dataset in JSONL format. You can refer to `example/trainingdata.jsonl`. Each entry should include: - utt — unique utterance ID (customizable) - wavpath — path to raw audio - sslpath — path to offline-extracted Whisper features (for semantic supervision) - semantictokenpath — path to offline-extracted semantic tokens To accelerate training, you need to extract semantic tokens and Whisper features offline first before starting. Refer to the feature extraction guide for detailed instructions. Step 2: Modify configuration files You can adjust training and DeepSpeed configurations by editing: - `configs/xxx.yaml` — main training configuration - `configs/dsstage2.json` — DeepSpeed configuration Step 3: Start training Run the following script to start SAC training: 🙏 Acknowledgement Our codebase builds upon the awesome SparkVox and DAC. We thank the authors for their excellent work. 🔖 Citation If you find this work useful in your research, please consider citing: 📜 License This project is licensed under the Apache 2.0 License.

license:apache-2.0

SoulX-Duplug-0.6B

NaNK

license:apache-2.0

Soul-AILab

SoulX-FlashTalk-14B

SoulX-Singer

SoulX-FlashHead-1_3B

SoulX-Podcast-1.7B

LiveAct

SoulX-Podcast-1.7B-dialect

SAC-16k-37_5Hz

SAC 16k 62 5Hz

SoulX-Duplug-0.6B