Vietnamese Text-to-Speech model finetuned from NeuTTS-Air on 2.6M+ Vietnamese audio samples.
NeuTTS-Air Vietnamese là mô hình Text-to-Speech (TTS) cho tiếng Việt, được finetune từ NeuTTS-Air base model trên dataset lớn 2.6M+ mẫu audio tiếng Việt.
- Base Model: neuphonic/neutts-air (Qwen2.5 0.5B - 552M parameters) - Language: Vietnamese (vi) - Task: Text-to-Speech (TTS) - Training Data: 2.6M+ Vietnamese audio samples - Audio Codec: NeuCodec - Sample Rate: 24kHz - License: Apache 2.0
✅ High Quality Vietnamese TTS - Natural Vietnamese speech synthesis ✅ Large-scale Training - Trained on 2.6M+ samples ✅ Voice Cloning - Clone voice from reference audio ✅ Text Normalization - Automatic Vietnamese text normalization with ViNorm ✅ Fast Inference - Optimized for production use ✅ Easy to Use - Simple API and Gradio UI
For easier usage, use the provided inference script:
- Dataset Size: 2.6M+ Vietnamese audio samples - Audio Format: WAV, 16kHz, mono - Text: Vietnamese with diacritics - Train/Val Split: 99.5% / 0.5%
- Base Model: neuphonic/neutts-air (Qwen2.5 0.5B) - Epochs: 3 - Batch Size: 4 per device - Gradient Accumulation: 2 steps (effective batch size: 8) - Learning Rate: 4e-5 - Optimizer: AdamW (fused) - Precision: BFloat16 - Hardware: NVIDIA RTX 3090 (24GB) - Training Time: ~2.5-3 days
- ✅ Pre-encoded Dataset - 6x faster training - ✅ TF32 Precision - 20% speedup on Ampere GPUs - ✅ Fused AdamW - 10% faster optimizer - ✅ Dataloader Optimizations - Pin memory, prefetch - ✅ Increased Batch Size - Better GPU utilization
Total Speedup: 10-12x faster than baseline (30 days → 2.5-3 days)
- Sample Rate: 24kHz - Natural Prosody: Yes - Voice Cloning: Supported - Text Normalization: Automatic (numbers, dates, abbreviations)
- GPU (RTX 3090): ~0.5s per sentence - CPU: ~3-5s per sentence
- Requires reference audio for voice cloning - Best results with clear, high-quality reference audio (3-10 seconds) - May struggle with very long sentences (>100 words) - Requires Vietnamese text with proper diacritics for best quality
⚠️ Voice Cloning Ethics: - Only use reference audio with proper consent - Do not use for impersonation or fraud - Respect privacy and intellectual property rights
⚠️ Potential Misuse: - Deepfake audio generation - Unauthorized voice cloning - Misinformation campaigns
Recommended Use: - Accessibility tools (text-to-speech for visually impaired) - Educational content - Virtual assistants - Audiobook narration (with consent) - Language learning applications
- Base Model: Neuphonic for NeuTTS-Air - Backbone: Qwen Team for Qwen2.5 - Codec: Neuphonic for NeuCodec - Phonemizer: espeak-ng - Text Normalization: ViNorm
Full training and inference code: https://github.com/iamdinhthuan/neutts-air-fintune
For questions or issues, please open an issue on GitHub.
Model Card Authors: Your Name Last Updated: 2025-01-01