hynt
F5-TTS-Vietnamese-ViVoice
🛑 Important Note ⚠️ This model is only intended for research purposes. Access requests must be made using an institutional, academic, or corporate email. Requests from public email providers will be denied. We appreciate your understanding. 🎙️ F5-TTS-Vietnamese-1000h A compact fine-tuned version of F5-TTS trained on 1000 hours of Vietnamese speech. 🔗 For more fine-tuning experiments, visit: https://github.com/nguyenthienhy/F5-TTS-Vietnamese. 📜 License: CC-BY-NC-SA-4.0 — Non-commercial research use only. 📌 Model Details - Dataset: Vi-Voice, VLSP 2021, VLSP 2022, VLSP 2023 - Total dataset durations: 1000 hours - Data processing Technique: - Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs - Do not use audio files shorter than 1 second or longer than 30 seconds. - Using Chunk-Large-Former Speech2Text model by Zalo-AI to filter audio which has bad transcript - Keep the default punctuation marks unchanged. - Normalize to lowercase format. - Training Configuration: - Base Model: F5-TTSBase - GPU: RTX 3090 - Batch Size: 3200 frames - 1.5 months for training 📝 Usage To load and use the model, follow the example below: ```bash git clone https://github.com/nguyenthienhy/F5-TTS-Vietnamese cd F5-TTS-Vietnamese python -m pip install -e. mv F5-TTS-Vietnamese-ViVoice/config.json F5-TTS-Vietnamese-ViVoice/vocab.txt f5-ttsinfer-cli \ --model "F5TTSBase" \ --refaudio ref.wav \ --reftext "cả hai bên hãy cố gắng hiểu cho nhau" \ --gentext "mình muốn ra nước ngoài để tiếp xúc nhiều công ty lớn, sau đó mang những gì học được về việt nam giúp xây dựng các công trình tốt hơn" \ --speed 1.0 \ --vocodername vocos \ --vocabfile F5-TTS-Vietnamese-ViVoice/vocab.txt \ --ckptfile F5-TTS-Vietnamese-ViVoice/modellast.pt \
Zipformer-30M-RNNT-6000h
Vietnamese Speech-to-Text (ASR) — ZipFormer-30M-RNNT-6000h 🔍 Overview The Vietnamese Speech-to-Text (ASR) model is built on the ZipFormer architecture — an improved variant of the Conformer — feat...
ZipVoice-Vietnamese-2500h
🛑 Important Note ⚠️ This model is only intended for research purposes. Access requests must be made using an institutional, academic, or corporate email. Requests from public email providers will be denied. We appreciate your understanding. 🎙️ ZipVoice-Vietnamese-2500h ZipVoice is a series of fast and high-quality zero-shot TTS models based on flow matching. Key features: 1. Small and fast: only 123M parameters. 2. High-quality voice cloning: state-of-the-art performance in speaker similarity, intelligibility, and naturalness. 4. Multi-mode: support both single-speaker and dialogue speech generation. This checkpoint is a compact fine-tuned version of ZipVoice trained on 2500 hours of Vietnamese speech. 🔗 For more fine-tuning and inference experiments, visit: https://github.com/k2-fsa/ZipVoice. 📜 License: CC-BY-NC-SA-4.0 — Non-commercial research use only. - Dataset: PhoAudioBook, ViVoice, TeacherDinh-UEH. - Total dataset durations: 2500 hours - Data processing Technique: - Remove all music background from audios, using facebook demucs model: https://github.com/facebookresearch/demucs - Do not use audio files shorter than 1 second or longer than 30 seconds. - Keep the default punctuation marks unchanged. - Normalize to lowercase format. - Training Configuration: - Base Model: ZipVoice with espeak-ng vi for tokenizer - GPU: RTX 3090 - Batch Siz: Max duration 200 - Training Progress: Stopped at 525,000 steps at epoch 11 🛑 Update Note Thank you, Teacher Định from the University of Economics Ho Chi Minh City (UEH), for providing me with an additional 50-hours high-quality labeled dataset.