The first 8-bit VibeVoice model that actually works
[](LICENSE) [](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8) [](https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8)
If you've tried other 8-bit quantized VibeVoice models, you probably got nothing but static noise. This one actually works.
The secret? Selective quantization: I only quantized the language model (the most robust part), while keeping audio-critical components (diffusion head, VAE, connectors) at full precision.
Results - ✅ Perfect audio, identical to the original model - ✅ 11.6 GB instead of 18.7 GB (-38%) - ✅ Uses ~12 GB VRAM instead of 20 GB - ✅ Works on 12 GB GPUs (RTX 3060, 4070 Ti, etc.)
Most 8-bit models you'll find online quantize everything aggressively: Result: Audio components get quantized → numerical errors propagate → audio = pure noise.
I only quantized what can be safely quantized without losing quality.
Result: 52% of parameters quantized, 48% at full precision = perfect audio quality.
| Model | Size | Audio Quality | Status | |-------|------|---------------|--------| | Original VibeVoice | 18.7 GB | ⭐⭐⭐⭐⭐ | Full precision | | Other 8-bit models | 10.6 GB | 💥 NOISE | ❌ Don't work | | This model | 11.6 GB | ⭐⭐⭐⭐⭐ | ✅ Perfect |
+1.0 GB vs other 8-bit models = perfect audio instead of noise. Worth it.
2. Download this model to `ComfyUI/models/vibevoice/`
Minimum - VRAM: 12 GB - RAM: 16 GB - GPU: NVIDIA with CUDA (required) - Storage: 11 GB
Recommended - VRAM: 16+ GB - RAM: 32 GB - GPU: RTX 3090/4090, A5000 or better
⚠️ Not supported: CPU, Apple Silicon (MPS), AMD GPUs
1. Requires NVIDIA GPU with CUDA - won't work on CPU or Apple Silicon 2. Inference only - don't use for fine-tuning 3. Requires: - `transformers>=4.51.3` - `bitsandbytes>=0.43.0`
✅ Use this 8-bit if: - You have 12-16 GB VRAM - You want maximum quality with reduced size - You need a production-ready model - You want the best size/quality balance
Use full precision (18.7 GB) if: - You have unlimited VRAM (24+ GB) - You're doing research requiring absolute precision
Use 4-bit NF4 (~6.6 GB) if: - You only have 8-10 GB VRAM - You can accept a small quality trade-off
- Close other GPU applications - Use `devicemap="auto"` - Reduce batch size to 1
This shouldn't happen! If it does: 1. Verify you downloaded the correct model 2. Update transformers: `pip install --upgrade transformers` 3. Check CUDA: `torch.cuda.isavailable()` should return `True`
- Original Model - Full precision base - ComfyUI Node - ComfyUI integration
- Issues: GitHub Issues - Questions: HuggingFace Discussions
The first 8-bit VibeVoice model that actually works