hexgrad
Kokoro-82M
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects. > [!NOTE] > As of April 2025, the market rate of Kokoro served over API is under $1 per million characters of text input, or under $0.06 per hour of audio output. (On average, 1000 characters of input is about 1 minute of output.) Sources: ArtificialAnalysis/Replicate at 65 cents per M chars and DeepInfra at 80 cents per M chars. > > This is an Apache-licensed model, and Kokoro has been deployed in numerous projects and commercial APIs. We welcome the deployment of the model in real use cases. > [!CAUTION] > Fake websites like kokorottsaicom (snapshot: https://archive.ph/nRRnk) and kokorottsnet (snapshot: https://archive.ph/60opa) are likely scams masquerading under the banner of a popular model. > > Any website containing "kokoro" in its root domain (e.g. kokorottsaicom, kokorottsnet) is NOT owned by and NOT affiliated with this model page or its author, and attempts to imply otherwise are red flags. - Releases - Usage - EVAL.md ↗️ - SAMPLES.md ↗️ - VOICES.md ↗️ - Model Facts - Training Details - Creative Commons Attribution - Acknowledgements | Model | Published | Training Data | Langs & Voices | SHA256 | | ----- | --------- | ------------- | -------------- | ------ | | v1.0 | 2025 Jan 27 | Few hundred hrs | 8 & 54 | `496dba11` | | v0.19 | 2024 Dec 25 | =0.9.2 soundfile !apt-get -qq -y install espeak-ng > /dev/null 2>&1 from kokoro import KPipeline from IPython.display import display, Audio import soundfile as sf import torch pipeline = KPipeline(langcode='a') text = ''' Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects. ''' generator = pipeline(text, voice='afheart') for i, (gs, ps, audio) in enumerate(generator): print(i, gs, ps) display(Audio(data=audio, rate=24000, autoplay=i==0)) sf.write(f'{i}.wav', audio, 24000) ``` Under the hood, `kokoro` uses `misaki`, a G2P library at https://github.com/hexgrad/misaki Architecture: - StyleTTS 2: https://arxiv.org/abs/2306.07691 - ISTFTNet: https://arxiv.org/abs/2203.02395 - Decoder only: no diffusion, no encoder release Architected by: Li et al @ https://github.com/yl4579/StyleTTS2 Model SHA256 Hash: `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4` Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include: - Public domain audio - Audio licensed under Apache, MIT, etc - Synthetic audio [1] generated by closed [2] TTS models from large providers [1] https://copyright.gov/ai/aipolicyguidance.pdf [2] No synthetic audio from open TTS models or "custom voice clones" Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM The following CC BY audio was part of the dataset used to train Kokoro v1.0. | Audio Data | Duration Used | License | Added to Training Set After | | ---------- | ------------- | ------- | --------------------------- | | Koniwa `tnc` |
Kokoro-82M-v1.1-zh
Kokoro is an open-weight series of small but powerful TTS models. This model is the result of a short training run that added 100 Chinese speakers from a professional dataset. The Chinese data was freely and permissively granted to us by LongMaoData, a professional dataset company. Thank you for making this model possible. Separately, some crowdsourced synthetic English data also entered the training mix: [1] - 1 hour of Maple, an American female. - 1 hour of Sol, another American female. - And 1 hour of Vale, an older British female. This model is not a strict upgrade over its predecessor since it drops many voices, but it is released early to gather feedback on new voices and tokenization. Aside from the Chinese dataset and the 3 hours of English, the rest of the data was left behind for this training run. The goal is to push the model series forward and ultimately restore some of the voices that were left behind. Current guidance from the U.S. Copyright Office indicates that synthetic data generally does not qualify for copyright protection. Since this synthetic data is crowdsourced, the model trainer is not bound by any Terms of Service. This Apache licensed model also aligns with OpenAI's stated mission of broadly distributing the benefits of AI. If you would like to help further that mission, consider contributing permissive audio data to the cause. [1] LongMaoData had no involvement in the crowdsourced synthetic English data. [2] The following Chinese text is machine-translated. > Kokoro 是一系列体积虽小但功能强大的 TTS 模型。 > > 该模型是经过短期训练的结果,从专业数据集中添加了100名中文使用者。中文数据由专业数据集公司「龙猫数据」免费且无偿地提供给我们。感谢你们让这个模型成为可能。 > > 另外,一些众包合成英语数据也进入了训练组合: > - 1小时的 Maple,美国女性。 > - 1小时的 Sol,另一位美国女性。 > - 和1小时的 Vale,一位年长的英国女性。 > > 由于该模型删除了许多声音,因此它并不是对其前身的严格升级,但它提前发布以收集有关新声音和标记化的反馈。除了中文数据集和3小时的英语之外,其余数据都留在本次训练中。目标是推动模型系列的发展,并最终恢复一些被遗留的声音。 > > 美国版权局目前的指导表明,合成数据通常不符合版权保护的资格。由于这些合成数据是众包的,因此模型训练师不受任何服务条款的约束。该 Apache 许可模式也符合 OpenAI 所宣称的广泛传播 AI 优势的使命。如果您愿意帮助进一步完成这一使命,请考虑为此贡献许可的音频数据。 - Releases - Usage - Samples ↗️ - Model Facts - Acknowledgements | Model | Published | Training Data | Langs & Voices | SHA256 | | ----- | --------- | ------------- | -------------- | ------ | | v1.1-zh | 2025 Feb 26 | >100 hours | 2 & 103 | `b1d8410f` | | v1.0 | 2025 Jan 27 | Few hundred hrs | 8 & 54 | `496dba11` | | v0.19 | 2024 Dec 25 | =0.8.2 "misaki[zh]>=0.8.2" soundfile !apt-get -qq -y install espeak-ng > /dev/null 2>&1 from IPython.display import display, Audio !wget https://huggingface.co/hexgrad/Kokoro-82M-v1.1-zh/resolve/main/samples/makeen.py !python makeen.py display(Audio('HEARMEen.wav', rate=24000, autoplay=True)) !wget https://huggingface.co/hexgrad/Kokoro-82M-v1.1-zh/resolve/main/samples/makezh.py !python makezh.py display(Audio('HEARMEzf001.wav', rate=24000, autoplay=False)) ``` TODO: Improve usage. Similar to https://hf.co/hexgrad/Kokoro-82M#usage but you should pass `repoid='hexgrad/Kokoro-82M-v1.1-zh'` when constructing a `KModel` or `KPipeline`. See `makeen.py` and `makezh.py`. Architecture: - StyleTTS 2: https://arxiv.org/abs/2306.07691 - ISTFTNet: https://arxiv.org/abs/2203.02395 - Decoder only: no diffusion, no encoder release - 82 million parameters, same as https://hf.co/hexgrad/Kokoro-82M Architected by: Li et al @ https://github.com/yl4579/StyleTTS2 Model SHA256 Hash: `b1d8410fa44dfb5c15471fd6c4225ea6b4e9ac7fa03c98e8bea47a9928476e2b` Acknowledgements TODO: Write acknowledgements. Similar to https://hf.co/hexgrad/Kokoro-82M#acknowledgements
styletts2
kLegacy
Legacy models that were once in https://hf.co/hexgrad/Kokoro-82M are moved here when superseded.