nineninesix
kani-tts-2-en
kani-tts-450m-0.2-pt
A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 450M parameters - Sample Rate: 22kHz - Languages: English, German, Arabic, Chinese, Korean, French, Japanese, Spanish - License: Apache 2.0 Nvidia RTX 5080 Benchmarks: - Latency: ~1 second to generate 15 seconds of audio - Memory: 2GB GPU VRAM - Quality Metrics: MOS 4.3/5 (naturalness), WER | | What do we say to the god of death? Not today! | | | What do you call a lawyer with an IQ of 60? Your honor | | | You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? | | - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions - Performance degrades with inputs exceeding 2000 tokens - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model - Fine-tuned Model - HuggingFace Space Examples: - Inference Example - Fine-tuning-code - Example Dataset - GitHub Repository Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
kani-tts-400m-0.3-pt
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: English, Chinese, Korean, Spanish, German, Japanese, German, Kyrgyz - License: Apache 2.0 It’s a lightweight so you can install, load a model, and speak in minutes. Designed for quick starts and simple workflows—no heavy setup, just pip install and run. More detailes... This model NOT support multiple speakers. But You can check if Your model supports speakers and select a specific voice: You can listen to generated audio directly in Jupyter notebooks or IPython: On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm GPU Benchmark Results | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Datasets - https://huggingface.co/datasets/laion/Emolia - https://huggingface.co/datasets/nytopop/expresso-conversational - https://huggingface.co/datasets/NightPrince/MasriSpeech-Full Use Cases - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions Limitations - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training Optimization Tips - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model: https://huggingface.co/nineninesix/kani-tts-500m-0.3-pt - English: https://huggingface.co/nineninesix/kani-tts-400m-en - Chinese: https://huggingface.co/nineninesix/kani-tts-400m-zh - Korean: https://huggingface.co/nineninesix/kani-tts-400m-ko - German: https://huggingface.co/nineninesix/kani-tts-400m-de - Spanish: https://huggingface.co/nineninesix/kani-tts-400m-es - Arabic: https://huggingface.co/nineninesix/kani-tts-400m-ar - Japanese: https://huggingface.co/nineninesix/kani-tts-370m-expo2025-osaka-ja Examples: - Space: https://huggingface.co/spaces/nineninesix/KaniTTS - OpenAI compatible API: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. - GitHub Repository: https://github.com/nineninesix-ai/kani-tts Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Acknowledgments Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Responsible Use Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
kani-tts-2-pt
kani-tts-370m-MLX
This model nineninesix/kani-tts-370m-MLX was converted to MLX format from nineninesix/kani-tts-370m using mlx-lm version 0.28.2.
kani-tts-450m-0.1-pt
kani-tts-370m
A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio c...
kani-tts-400m-en
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: English - License: Apache 2.0 On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Lower RTF is better ( | | Colleges of Oxford, Cambridge, Durham and the University of the Highlands and Islands UHI are 'listed bodies', as bodies that appear to the Secretary of State to be constituent colleges, schools, halls or other institutions of a university. | | | A joyful flock of sparrows chirped merrily in the old oak tree outside my window this morning. | | | Darlin', I still ain't feelin' so well. I'm goin' to bed. | | Use Cases - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions Limitations - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training Optimization Tips - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model: https://huggingface.co/nineninesix/kani-tts-500m-0.3-pt - Space: https://huggingface.co/spaces/nineninesix/KaniTTS Examples: - OpenAI compatible API Example: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - GitHub Repository: https://github.com/nineninesix-ai/kani-tts - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Acknowledgments Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Responsible Use Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
kani-tts-400m-es
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: Spanish - License: Apache 2.0 On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Lower RTF is better ( | | A veces, una simple mirada dice más que mil palabras. | | | ¿Será que todavía me recuerdas como antes? | | | ¡Qué alegría volver a verte después de tanto tiempo! | | Use Cases - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions Limitations - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training Optimization Tips - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model: https://huggingface.co/nineninesix/kani-tts-500m-0.3-pt - Space: https://huggingface.co/spaces/nineninesix/KaniTTS Examples: - OpenAI compatible API Example: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - GitHub Repository: https://github.com/nineninesix-ai/kani-tts - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Acknowledgments Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Responsible Use Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
kani-tts-450m-0.2-ft
kani-tts-400m-ky-kani
kani-tts-400m-de
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: Deutsch - License: Apache 2.0 On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Lower RTF is better ( | | Was für ein unglaublicher Tag, voller kleiner Wunder! | | | Hast du jemals das Gefühl, dass die Zeit einfach davonrennt? | | | Warum habe ich das Gefühl, dass heute etwas Großes passiert? | | Use Cases - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions Limitations - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training Optimization Tips - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Resources Models: - Pretrained Model: https://huggingface.co/nineninesix/kani-tts-500m-0.3-pt - Space: https://huggingface.co/spaces/nineninesix/KaniTTS Examples: - OpenAI compatible API Example: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - GitHub Repository: https://github.com/nineninesix-ai/kani-tts - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Acknowledgments Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Responsible Use Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible. Citation
kani-tts-400m-ar
Nemo Nano Codec 22khz 0.6kbps 12.5fps MLX
kani-tts-370m-expo2025-osaka-ja
A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. 『いのち輝く未来社会のデザイン』という大阪・関西万博2025のテーマを祝し、キルギスの人々から日本の皆さまへ --心と心 をつなぐ贈り物として、どうぞお受け取り ください。 In honor of Expo Osaka 2025 and its motto 'Designing Future Society for Our Lives,' we humbly present this gift from the people of the Kyrgyz Republic to the people of Japan - heart to heart. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 370M parameters - Sample Rate: 22kHz - Languages: Japanese - License: Apache 2.0 It’s a lightweight so you can install, load a model, and speak in minutes. Designed for quick starts and simple workflows—no heavy setup, just pip install and run. More detailes... You can listen to generated audio directly in Jupyter notebooks or IPython: Nvidia RTX 5090 Benchmarks: - Latency: ~1 second to generate 15 seconds of audio - Memory: 2GB GPU VRAM - Quality Metrics: MOS 4.3/5 (naturalness), WER | | 2025年の大阪・関西万博は素晴らしいイベントでした。| | |「いのち輝く未来社会のデザイン」というテーマが多くの人の心に残りました。 | | | 世界中の国々が未来の技術を紹介しました。 | | | 小さな一歩でも、前に進めば景色が変わります。 | | | 何気ない日常の中にも、心が温まる瞬間があります。 | | - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions - Performance degrades with inputs exceeding 2000 tokens - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model - Fine-tuned Model - HuggingFace Space Examples: - Inference Example - Fine-tuning-code - Example Dataset - GitHub Repository Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
kani-tts-400m-ky
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: Kyrgyz - License: Apache 2.0 On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Lower RTF is better ( Кантип жардам бере алам силерге? | | | Ысык-Көлдүн жээгинде күн батканда, асман көгүлтүр алтынга айланат — көз жоосун алган сулуулук! | | | Тоолорду карачы, канчалык бийик болсо да, кыргыздын руху андан да бийик! | | | Бишкек бүгүн кандай сонун, ээ? – Ооба, шамал да жумшак, асман да ачык! | | | Ой, бозо менен самсанын жыты чыгып кетти, ачка болуп кеттим го!| | During fine-tuning, the dataset included special emotional control tokens such as These tags help the model adjust tone, rhythm, and expressive style during speech synthesis. Use Cases - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions Limitations - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training Optimization Tips - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model: nineninesix/kani-tts-400m-0.3-pt - Space: nineninesix/KaniTTS-KyrGyz Examples: - OpenAI compatible API Example: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - GitHub Repository: https://github.com/nineninesix-ai/kani-tts - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Acknowledgments Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Responsible Use Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible.
kani-tts-450m-0.1-ft
kani-tts-400m-ko
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: Korean - License: Apache 2.0 It’s a lightweight so you can install, load a model, and speak in minutes. Designed for quick starts and simple workflows—no heavy setup, just pip install and run. More detailes... You can listen to generated audio directly in Jupyter notebooks or IPython: On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Lower RTF is better ( | | 이 느낌... 왠지 처음이 아닌 것 같아. | | |조용한 밤에 혼자 있으니까 마음이 좀 이상해. | | - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Models: - Pretrained Model: https://huggingface.co/nineninesix/kani-tts-500m-0.3-pt - Space: https://huggingface.co/spaces/nineninesix/KaniTTS Examples: - OpenAI compatible API Example: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - GitHub Repository: https://github.com/nineninesix-ai/kani-tts - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible. Citation Kyubyong Park, KSS Dataset: Korean Single speaker Speech Dataset, https://kaggle.com/bryanpark/korean-single-speaker-speech-dataset, 2018
kani-tts-400m-zh
[](https://discord.gg/NzP3rjB4SB) [](https://opensource.org/licenses/Apache-2.0) A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications. KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency. Key Specifications: - Model Size: 400M parameters - Sample Rate: 22kHz - Language: Chinese Mandarin and Cantonese - License: Apache 2.0 On NovitaAI RTX 5090 using vLLM: - RTF: ~0.2 (5 times faster than realtime) - Memory: 16GB GPU VRAM used - Source Code: https://github.com/nineninesix-ai/kanitts-vllm | GPU Model | VRAM | Cost ($/hr) | RTF | |-----------|------|-------------|-----| | RTX 5090 | 32GB | $0.423 | 0.190 | | RTX 4080 | 16GB | $0.220 | 0.200 | | RTX 5060 Ti | 16GB | $0.138 | 0.529 | | RTX 4060 Ti | 16GB | $0.122 | 0.537 | | RTX 3060 | 12GB | $0.093 | 0.600 | Lower RTF is better ( | | 今朝心情好得来,连天色都看得欢喜咯。 | | | 啊哟,真格个,听到介话心里一热咯! | | | 有辰光一个人坐着,想东想西,真是闲不牢呃。| | Use Cases - Conversational AI: Real-time speech for chatbots and virtual assistants - Edge/Server Deployment: Resource-efficient inference on affordable hardware - Accessibility: Screen readers and language learning applications - Research: Fine-tuning for specific voices, accents, or emotions Limitations - Performance degrades with inputs exceeding 15 seconds (need to use sliding window chunking) - Limited expressivity without fine-tuning for specific emotions - May inherit biases from training data in prosody or pronunciation - Optimized primarily for English; other languages may require additional training Optimization Tips - Multilingual Performance: Continually pretrain on target language datasets and fine-tune NanoCodec - Batch Processing: Use batches of 8-16 for high-throughput scenarios - Hardware: Optimized for NVIDIA Blackwell architecture GPUs Resources Models: - Pretrained Model: https://huggingface.co/nineninesix/kani-tts-500m-0.3-pt - Space: https://huggingface.co/spaces/nineninesix/KaniTTS Examples: - OpenAI compatible API Example: https://github.com/nineninesix-ai/kanitts-vllm - Finetuning code pipeline: https://github.com/nineninesix-ai/KaniTTS-Finetune-pipeline - Dataset preparation pipeline: https://github.com/nineninesix-ai/nano-codec-dataset-pipeline - Example Dataset: https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset - GitHub Repository: https://github.com/nineninesix-ai/kani-tts - ComfyUI node: https://github.com/wildminder/ComfyUI-KaniTTS by WildAi - NextJS basic app: https://github.com/nineninesix-ai/open-audio. It uses the OpenAI npm package to connect to the OpenAI-compatible server API provided by kanitts-vllm. Links: - Website: https://www.nineninesix.ai - Contact Form: https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form Acknowledgments Built on top of LiquidAI LFM2 350M as the backbone and Nvidia NanoCodec for audio processing. Responsible Use Prohibited activities include: - Illegal content or harmful, threatening, defamatory, or obscene material - Hate speech, harassment, or incitement of violence - Generating false or misleading information - Impersonating individuals without consent - Malicious activities such as spamming, phishing, or fraud By using this model, you agree to comply with these restrictions and all applicable laws. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible. Citation
kyrgyz-whisper-medium
kani-tts-400m-en-mlx
This model nineninesix/kani-tts-400m-en-mlx was converted to MLX format from nineninesix/kani-tts-400m-en using mlx-lm version 0.28.2.