OuteAI
Llama-OuteTTS-1.0-1B
> [!IMPORTANT] > Important Sampling Considerations > > When using OuteTTS version 1.0, it is crucial to use the settings specified in the Sampling Configuration section. > The repetition penalty implementation is particularly important - this model requires penalization applied to a 64-token recent window, > rather than across the entire context window. Penalizing the entire context will cause the model to produce broken or low-quality output. > > To address this limitation, all necessary samplers and patches for all backends are set up automatically in the outetts library. > If using a custom implementation, ensure you correctly implement these requirements. This update brings significant improvements in speech synthesis and voice cloning—delivering a more powerful, accurate, and user-friendly experience in a compact size. 1. Prompt Revamp & Dependency Removal - Automatic Word Alignment: The model now performs word alignment internally. Simply input raw text—no pre-processing required—and the model handles the rest, streamlining your workflow. For optimal results, use normalized, readable text without newlines (light normalization is applied automatically in outetts library). - Native Multilingual Text Support: Direct support for native text across multiple languages eliminates the need for romanization. - Enhanced Metadata Integration: The updated prompt system incorporates additional metadata (time, energy, spectral centroid, pitch) at both global and word levels, improving speaker flow and synthesis quality. - Special Tokens for Audio Codebooks: New tokens for c1 (codebook 1) and c2 (codebook 2). 2. New Audio Encoder Model - DAC Encoder: Integrates a DAC audio encoder from ibm-research/DAC.speech.v1.0, utilizing two codebooks for high quality audio reconstruction. - Performance Trade-off: Improved audio fidelity increases the token generation rate from 75 to 150 tokens per second. This trade-off prioritizes quality, especially for multilingual applications. 3. Voice Cloning - One-Shot Voice Cloning: To achieve one-shot cloning, the model typically requires only around 10 seconds of reference audio to produce an accurate voice representation. - Improved Accuracy: Enhanced by the new encoder and additional training metadata, voice cloning is now more natural and precise. 4. Auto Text Alignment & Numerical Support - Automatic Text Alignment: Aligns raw text at the word level, even for languages without clear boundaries (e.g., Japanese, Chinese), using insights from pre-processed training data. - Direct Numerical Input: Built-in multilingual numerical support allows direct use of numbers in prompts—no textual conversion needed. (The model typically chooses the dominant language present. Mixing languages in a single prompt may lead to mistakes.) - Supported Languages: OuteTTS offers varying proficiency levels across languages, based on training data exposure. - High Training Data Languages: These languages feature extensive training: English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Lithuanian, Russian, Spanish - Moderate Training Data Languages: These languages received moderate training, offering good performance with occasional limitations: Portuguese, Belarusian, Bengali, Georgian, Hungarian, Latvian, Persian/Farsi, Polish, Swahili, Tamil, Ukrainian - Beyond Supported Languages: The model can generate speech in untrained languages with varying success. Experiment with unlisted languages, though results may not be optimal. More Configuration Options For advanced settings and customization, visit the official repository: 🔗 interfaceusage.md Speaker Reference The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower-quality outputs. The model inherits the referenced speaker's emotion, style, and accent. When transcribing to other languages with the same speaker, you may observe the model retaining the original accent. Multilingual Application It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features. While the model supports cross-lingual speech, it still relies on the reference speaker. If the speaker has a distinct accent—such as British English—other languages may carry that accent as well. Optimal Audio Length - Best Performance: Generate audio around 42 seconds in a single run (approximately 8,192 tokens). It is recomended not to near the limits of this windows when generating. Usually, the best results are up to 7,000 tokens. - Context Reduction with Speaker Reference: If the speaker reference is 10 seconds long, the effective context is reduced to approximately 32 seconds. Temperature Setting Recommendations Testing shows that a temperature of 0.4 is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication. Verifying Speaker Encoding If the cloned voice quality is subpar, check the encoded speaker sample. The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality. Sampling Configuration For optimal results with this TTS model, use the following sampling settings. | Parameter | Value | |-------------------|----------| | Temperature | 0.4 | | Repetition Penalty| 1.1 | | Repetition Range | 64 | | Top-k | 40 | | Top-p | 0.9 | | Min-p | 0.05 | - Training Data: Trained on ~60k hours of audio - Context Length: Supports a maximum context window of 8,192 tokens Pre-Training - Optimizer: AdamW - Batch Size: 1 million tokens - Max Learning Rate: 3e-4 - Min Learning Rate: 3e-5 - Context Length: 8192 Fine-Tuning - Optimizer: AdamW - Max Learning Rate: 1e-5 - Min Learning Rate: 5e-6 - Data: 10,000 diverse, high-quality examples - Initial Llama3.2 Components: Llama 3.2 Community License Agreement - Our Continued Pre-Training, Fine-Tuning, and Additional Components: CC-BY-NC-SA-4.0 - Big thanks to Hugging Face for their continued resource support through their grant program! - Audio encoding and decoding utilize ibm-research/DAC.speech.v1.0 - OuteTTS is built with Llama3.2-1B as the base model, with continued pre-training and fine-tuning. Ethical Use Guidelines This text-to-speech model is intended for legitimate applications that enhance accessibility, creativity, and communication; prohibited uses include impersonation without consent, creation of deliberately misleading content, generation of harmful or harassing material, distribution of synthetic audio without proper disclosure, voice cloning without permission, and any uses that violate applicable laws, regulations, or copyrights.
Llama-OuteTTS-1.0-1B-GGUF
> [!IMPORTANT] > Important Sampling Considerations > > When using OuteTTS version 1.0, it is crucial to use the settings specified in the Sampling Configuration section. > The repetition penalty implementation is particularly important - this model requires penalization applied to a 64-token recent window, > rather than across the entire context window. Penalizing the entire context will cause the model to produce broken or low-quality output. > > To address this limitation, all necessary samplers and patches for all backends are set up automatically in the outetts library. > If using a custom implementation, ensure you correctly implement these requirements. This update brings significant improvements in speech synthesis and voice cloning—delivering a more powerful, accurate, and user-friendly experience in a compact size. 1. Prompt Revamp & Dependency Removal - Automatic Word Alignment: The model now performs word alignment internally. Simply input raw text—no pre-processing required—and the model handles the rest, streamlining your workflow. For optimal results, use normalized, readable text without newlines (light normalization is applied automatically in outetts library). - Native Multilingual Text Support: Direct support for native text across multiple languages eliminates the need for romanization. - Enhanced Metadata Integration: The updated prompt system incorporates additional metadata (time, energy, spectral centroid, pitch) at both global and word levels, improving speaker flow and synthesis quality. - Special Tokens for Audio Codebooks: New tokens for c1 (codebook 1) and c2 (codebook 2). 2. New Audio Encoder Model - DAC Encoder: Integrates a DAC audio encoder from ibm-research/DAC.speech.v1.0, utilizing two codebooks for high quality audio reconstruction. - Performance Trade-off: Improved audio fidelity increases the token generation rate from 75 to 150 tokens per second. This trade-off prioritizes quality, especially for multilingual applications. 3. Voice Cloning - One-Shot Voice Cloning: To achieve one-shot cloning, the model typically requires only around 10 seconds of reference audio to produce an accurate voice representation. - Improved Accuracy: Enhanced by the new encoder and additional training metadata, voice cloning is now more natural and precise. 4. Auto Text Alignment & Numerical Support - Automatic Text Alignment: Aligns raw text at the word level, even for languages without clear boundaries (e.g., Japanese, Chinese), using insights from pre-processed training data. - Direct Numerical Input: Built-in multilingual numerical support allows direct use of numbers in prompts—no textual conversion needed. (The model typically chooses the dominant language present. Mixing languages in a single prompt may lead to mistakes.) - Supported Languages: OuteTTS offers varying proficiency levels across languages, based on training data exposure. - High Training Data Languages: These languages feature extensive training: English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Lithuanian, Russian, Spanish - Moderate Training Data Languages: These languages received moderate training, offering good performance with occasional limitations: Portuguese, Belarusian, Bengali, Georgian, Hungarian, Latvian, Persian/Farsi, Polish, Swahili, Tamil, Ukrainian - Beyond Supported Languages: The model can generate speech in untrained languages with varying success. Experiment with unlisted languages, though results may not be optimal. More Configuration Options For advanced settings and customization, visit the official repository: 🔗 interfaceusage.md Speaker Reference The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower-quality outputs. The model inherits the referenced speaker's emotion, style, and accent. When transcribing to other languages with the same speaker, you may observe the model retaining the original accent. Multilingual Application It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features. While the model supports cross-lingual speech, it still relies on the reference speaker. If the speaker has a distinct accent—such as British English—other languages may carry that accent as well. Optimal Audio Length - Best Performance: Generate audio around 42 seconds in a single run (approximately 8,192 tokens). It is recomended not to near the limits of this windows when generating. Usually, the best results are up to 7,000 tokens. - Context Reduction with Speaker Reference: If the speaker reference is 10 seconds long, the effective context is reduced to approximately 32 seconds. Temperature Setting Recommendations Testing shows that a temperature of 0.4 is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication. Verifying Speaker Encoding If the cloned voice quality is subpar, check the encoded speaker sample. The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality. Sampling Configuration For optimal results with this TTS model, use the following sampling settings. | Parameter | Value | |-------------------|----------| | Temperature | 0.4 | | Repetition Penalty| 1.1 | | Repetition Range | 64 | | Top-k | 40 | | Top-p | 0.9 | | Min-p | 0.05 | - Training Data: Trained on ~60k hours of audio - Context Length: Supports a maximum context window of 8,192 tokens Pre-Training - Optimizer: AdamW - Batch Size: 1 million tokens - Max Learning Rate: 3e-4 - Min Learning Rate: 3e-5 - Context Length: 8192 Fine-Tuning - Optimizer: AdamW - Max Learning Rate: 1e-5 - Min Learning Rate: 5e-6 - Data: 10,000 diverse, high-quality examples - Initial Llama3.2 Components: Llama 3.2 Community License Agreement - Our Continued Pre-Training, Fine-Tuning, and Additional Components: CC-BY-NC-SA-4.0 - Big thanks to Hugging Face for their continued resource support through their grant program! - Audio encoding and decoding utilize ibm-research/DAC.speech.v1.0 - OuteTTS is built with Llama3.2-1B as the base model, with continued pre-training and fine-tuning. Ethical Use Guidelines This text-to-speech model is intended for legitimate applications that enhance accessibility, creativity, and communication; prohibited uses include impersonation without consent, creation of deliberately misleading content, generation of harmful or harassing material, distribution of synthetic audio without proper disclosure, voice cloning without permission, and any uses that violate applicable laws, regulations, or copyrights.
OuteTTS-0.2-500M-GGUF
OuteTTS-1.0-0.6B-GGUF
OuteTTS-0.1-350M-GGUF
OuteTTS 0.3 1B
Oute A I 🌐 OuteAI.com 💬 Join our Discord 𝕏 @OuteAI OuteTTS 0.3 1B OuteTTS 0.3 1B GGUF OuteTTS 0.3 500M OuteTTS 0.3 500M GGUF OuteTTS 0.3 Demo Space GitHub - OuteTTS OuteTTS version 0.3 introduces multiple model variants tailored for diverse use cases. This release significantly enhances the naturalness and coherence of speech synthesis by adding punctuation support, improving the flow and clarity of generated speech. The following punctuation marks are supported: `'.', '!', '?', ',', '"', '„', '¡', '¿', '…', '...', '。', '!', '?', ',', '؟'`. These are converted into special tokens, for instance, `.` is transformed into ` `. Additionally, the models were trained on refined and extended datasets, offering broader linguistic coverage. With this version, two new languages, German (de) and French (fr), are supported, bringing the total to six languages: English (en), Japanese (jp), Korean (ko), Chinese (zh), French (fr), and German (de). OuteTTS is a solution designed to extend any existing large language model (LLM) with text-to-speech (TTS) and speech-to-speech capabilities. By preserving the original architecture, it ensures high compatibility with a broad range of libraries and tools, making it easy to integrate speech functionalities without compromising on flexibility. Experimental voice control features are also included, though they are in very early stage of development. Due to limited data, these features may produce inconsistent results and might sometimes be ignored by the model. Special thanks to Hugging Face 🤗 for providing the GPU grant that made training this model possible! OuteTTS-0.3-500M - Base: Qwen2.5-0.5B (Apache-2.0) - TTS Model License: CC-BY-SA-4.0 - Training: 10,000 hours of speech audio (~4 billion tokens) - Supported Languages: en, jp, ko (small dataset), zh, fr, de OuteTTS-0.3-1B - Base: OLMo-1B (Apache-2.0) - TTS Model License: CC-BY-NC-SA-4.0 (Incorporates the Emilia dataset, for improved quality) - Training: 20,000 hours of speech audio (~8 billion tokens) - Supported Languages: en, jp, ko, zh, fr, de > [!IMPORTANT] > For additional usage examples and recommendations, visit the: GitHub repository. > [!IMPORTANT] > The model performs best with 30-second generation batches. This window is reduced based on the length of your speaker samples. For example, if the speaker reference sample is 10 seconds, the effective window becomes approximately 20 seconds. I am currently working on adding batched generation capabilities to the library, along with further improvements that are not yet implemented. The OuteTTS-0.3-1B training data incorporates various publicly available speech datasets. Below is a summary of the key data sources: - Emilia Dataset: CC-BY-NC 4.0 - Mozilla Common Voice: CC-0 - MLCommons People's Speech Dataset (selected portions): CC-BY 4.0 - Noisy Speech Database (Edinburgh DataShare): CC BY 4.0 - Multilingual LibriSpeech (MLS): CC BY 4.0 - CSTR VCTK Corpus (Edinburgh DataShare): CC BY 4.0 - THCHS-30 (Open Speech and Language Resources): Apache-2.0 - Zeroth-Korean (Open Speech and Language Resources): CC BY 4.0 - Aishell (Open Speech and Language Resources): Apache-2.0 - Other permissively licensed datasets Special acknowledgment to the open-source community and researchers for their valuable contributions. - WavTokenizer GitHub | WavTokenizer HF - CTC Forced Alignment - Qwen-2.5-0.5B - OLMo-1B
Llama-OuteTTS-1.0-1B-FP8
OuteTTS-0.3-500M-GGUF
OuteTTS-1.0-0.6B
Lite-Mistral-150M-v2-Instruct-GGUF
OuteTTS-0.3-1B-GGUF
Lite-Oute-1-300M-Instruct-GGUF
Lite-Oute-1-300M-Instruct
OuteTTS-0.3-500M
Lite-Oute-1-300M-GGUF
OuteTTS 0.2 500M
OuteAI 🌐 OuteAI.com 💬 Join our Discord 𝕏 @OuteAI 🤗 Hugging Face - OuteTTS 0.2 500M 🤗 Hugging Face - OuteTTS 0.2 500M GGUF 🤗 Hugging Face - Demo Space GitHub - OuteTTS OuteTTS-0.2-500M is our improved successor to the v0.1 release. The model maintains the same approach of using audio prompts without architectural changes to the foundation model itself. Built upon the Qwen-2.5-0.5B, this version was trained on larger and more diverse datasets, resulting in significant improvements across all aspects of performance. Special thanks to Hugging Face for providing GPU grant that supported the training of this model! - Enhanced Accuracy: Significantly improved prompt following and output coherence compared to the previous version - Natural Speech: Produces more natural and fluid speech synthesis - Expanded Vocabulary: Trained on over 5 billion audio prompt tokens - Voice Cloning: Improved voice cloning capabilities with greater diversity and accuracy - Multilingual Support: New experimental support for Chinese, Japanese, and Korean languages Important: - For GGUF support, install `llama-cpp-python` manually. Installation Guide - For EXL2 support, install `exllamav2` manually. Installation Guide You can create a speaker profile for voice cloning, which is compatible across all backends. Speaker profiles can be saved and loaded across all supported backends. OuteTTS includes a set of default speaker profiles. Use them directly: The generation process is consistent across all backends. You can initialize custom backend configurations for specific needs. Example with Flash Attention for Hugging Face Transformers To achieve the best results when creating a speaker profile, consider the following recommendations: 1. Audio Clip Duration: - Use an audio clip of around 10-15 seconds. - This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip. 2. Audio Quality: - Ensure the audio is clear and noise-free. Background noise or distortions can reduce the model's ability to extract accurate voice features. 3. Accurate Transcription: - Provide a highly accurate transcription of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results. 4. Speaker Familiarity: - The model performs best with voices that are similar to those seen during training. Using a voice that is significantly different from typical training samples (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication. - In such cases, you may need to fine-tune the model specifically on your target speaker's voice to achieve a better representation. 5. Parameter Adjustments: - Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice. Model Specifications - Base Model: Qwen-2.5-0.5B - Parameter Count: 500M - Language Support: - Primary: English - Experimental: Chinese, Japanese, Korean - License: CC BY NC 4.0 Training Datasets - Emilia-Dataset (CC BY NC 4.0) - LibriTTS-R (CC BY 4.0) - Multilingual LibriSpeech (MLS) (CC BY 4.0) Credits & References - WavTokenizer - CTC Forced Alignment - Qwen-2.5-0.5B
Lite-Oute-1-65M-Instruct-GGUF
Lite-Mistral-150M-v2-Instruct
Lite-Oute-1-65M-GGUF
OuteTTS-0.1-350M
OuteTTS-1.0-0.6B-FP8
Lite-Oute-1-300M
Lite-Oute-1-65M-Instruct
Llama-OuteTTS-1.0-1B-ONNX
Lite-Oute-1-65M
OuteTTS-1.0-0.6B-ONNX
Note: Having a separate repo for ONNX weights is intended to be a temporary solution until WebML gains more traction. If you would like to make your models web-ready, we recommend converting to ONNX using 🤗 Optimum and structuring your repo like this one (with ONNX weights located in a subfolder named `onnx`).