NandemoGHS
Anime-XCodec2
Anime‑XCodec2: Japanese Fine‑Tuned Variant of XCodec2 [](https://creativecommons.org/licenses/by-nc/4.0/) TL;DR: Anime‑XCodec2 is a fine‑tuned variant of HKUSTAudio/xcodec2, trained on \~25k hours of Japanese anime/game‑style voices. Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa). Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-Demo Baseline model (pretrained): `HKUSTAudio/xcodec2` This repository (fine‑tuned): `NandemoGHS/Anime‑XCodec2` Training Logs (Weights & Biases): View Report What it is: A neural speech codec / speech tokenizer model based on XCodec2 with a decoder fine‑tuned for Japanese speech, particularly anime/game‑style voices. Training scope: Decoder‑only fine‑tuning on \~25,000 hours of Japanese data; encoder and codebook are frozen. Compatibility: Because the encoder and codebook are unchanged, speech tokens produced at encode time are identical with the original XCodec2. Any downstream model expecting XCodec2 codes can use Anime‑XCodec2 as a drop‑in decoder (e.g., Llasa). Sampling rate: 16 kHz (XCodec2 operates at 16 kHz). Decode XCodec2 speech tokens (e.g., from Llasa or other AR token generators trained on XCodec2 codes) into Japanese speech with improved naturalness for anime/game‑style voices. Reconstruction of Japanese speech from XCodec2 tokens when analyzing or building Japanese‑focused speech pipelines. Language scope: Optimized for Japanese. Performance on other languages may degrade compared to the baseline XCodec2 Sampling rate: 16 kHz only (resample inputs to 16 kHz before encoding; decode assumes 16 kHz). Content domain: Tuned toward anime/game‑style voices; out‑of‑domain speech may not benefit. \~25,000 hours of Japanese speech, with a focus on anime/game‑style voices (acting, character voices, etc.). Data preparation included resampling to 16 kHz and standard loudness/peak checks where appropriate. Updated (fine‑tuned): `generator.backbone`, `generator.head`, `fcposta` Frozen: all other non‑listed components Goal: preserve token compatibility with `HKUSTAudio/xcodec2` while improving reconstruction quality for Japanese anime/game‑style speech. | ID | Original (reference) | Baseline Reconstruct (`HKUSTAudio/xcodec2`) | Anime‑XCodec2 Reconstruct (this model) | | -: | :----------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------- | | 1 | | | | | 2 | | | | | 3 | | | | Note: original audio is 48 / 44.1 kHz, while reconstructed audios are at 16 kHz These samples come from NandemoGHS/Japanese-Eroge-Voice and were not included in the training or validation data. CC‑BY‑NC 4.0 (same as the original XCodec2 license). See: https://creativecommons.org/licenses/by-nc/4.0/ Original: HKUSTAudio/xcodec2 Thanks to contributors and the community around Japanese speech resources.
Anime-Llasa-3B
This is the Anime-Llasa-3B, a Text-to-Speech (TTS) model fine-tuned for Japanese. This model is based on HKUSTAudio/Llasa-3B. You can try a demo on Hugging Face Spaces: Anime-Llasa-3B-Demo The primary improvement in this version is a significant increase in the training data. The amount of training data has been increased from approximately 14,000 hours (3 epochs) to approximately 33,000 hours (1 epoch). This enhancement aims to further improve the model's expressiveness and overall stability.
Anime-Llasa-3B-Captions
This is Anime-Llasa-3B-Captions, a Text-to-Speech (TTS) model fine-tuned for Japanese, based on NandemoGHS/Anime-Llasa-3B. This version has been further fine-tuned with additional data, incorporating detailed audio metadata generated by Gemini 2.5 Pro. The key improvement in this model is its training methodology. I used Gemini 2.5 Pro to generate detailed metadata (captions, speaker profiles, emotions, etc.) for the audio data. The model was then fine-tuned on this dataset, learning to associate text with these rich descriptive tags. This allows for highly controllable speech synthesis by specifying desired audio characteristics in the prompt. You can control the generated speech in two main ways: You can guide the speech synthesis by providing specific tags in the system prompt. The model expects the following format (note: `emotion` tags are in English, while others should be in Japanese): `caption`: (Required) A general description of the audio content. `emotion`: Emotion tag (e.g., `angry`, `sad`, `happy`, `serious`). `profile`: Speaker profile (e.g., `若い女性声`, `大人の男性声`). `mood`: Mood (e.g., `恥ずかしさ`, `悲しみ`). `speed`: Speaking speed (e.g., `ゆっくり`, `速い`). `prosody`: Prosody/Rhythm (e.g., `震え声`, `平坦`). `pitchtimbre`: Pitch/Timbre (e.g., `高め`, `低め`, `息多め`). `style`: Style (e.g., `ナレーション風`, `会話調`). `notes`: Special notes (距離感、ブレスなど). Additionally, you can control the speech style directly within the transcription text by using full-width Japanese parentheses `( )`. For example, adding `(囁き)` (whisper) to the text will prompt the model to generate that part of the speech in a whispering voice. For detailed usage instructions and to try the model, please see the Hugging Face Space: Please note that due to limitations in the amount and quality of the training data, the model cannot be controlled perfectly. The generated speech may not always reflect the specified tags precisely. The dataset used for this fine-tuning, which includes the Gemini 2.5 Pro generated captions, is available here: Additionally, as this model includes outputs from Gemini 2.5 Pro in its training data, any use that competes with Gemini is prohibited.
Anime XCodec2 44.1kHz
Anime-XCodec2-44.1kHz: A 44.1kHz Upsampling Variant of Anime-XCodec2 [](https://creativecommons.org/licenses/by-nc/4.0/) TL;DR: `Anime-XCodec2-44.1kHz` is a fine-tuned variant of NandemoGHS/Anime-XCodec2. It incorporates upsampling layers and RMS loss inspired by the Inworld TTS-1 paper to produce 44.1kHz output, trained on ~22k hours of Japanese speech. Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa). Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-Demo This repository (44.1kHz fine-tune): `NandemoGHS/Anime-XCodec2-44.1kHz` Baseline 16kHz model: `NandemoGHS/Anime-XCodec2` Original XCodec2: `HKUSTAudio/xcodec2` Reference Paper (Inworld TTS-1): https://arxiv.org/abs/2507.21138 Reference Implementation (Inworld TTS): https://github.com/inworld-ai/tts What it is: A neural speech codec based on Anime-XCodec2 (which is based on XCodec2), fine-tuned to output 44.1kHz high-fidelity Japanese speech (anime/game-style). Key Change: Integrates an UpSamplerBlock and utilizes RMS Loss (inspired by Inworld TTS-1) into the decoder architecture. Training scope: Decoder-only fine-tuning on ~22,000 hours of Japanese data. Encoder and codebook are frozen. Compatibility: Speech tokens are identical to `HKUSTAudio/xcodec2` and `NandemoGHS/Anime-XCodec2`. Input Sampling rate: 16 kHz (for encoding, same as XCodec2). Output Sampling rate: 44.1 kHz (decoded audio). Decode XCodec2 speech tokens (e.g., from Llasa or other AR generators) into high-fidelity 44.1kHz Japanese speech (anime/game-style). Upgrade existing `Anime-XCodec2` (16kHz) pipelines to 44.1kHz output. Audio Super-Resolution: As the model accepts 16kHz input and outputs 44.1kHz reconstructed audio, it can also be used as a form of audio super-resolution. However, its performance for this specific purpose is untested/unevaluated. This model modifies the original XCodec2 architecture by adding upsampler blocks. You MUST use the provided custom `xcodec2` library fork for inference, as the standard library will not work. Usage: Once the custom library is installed, you can load and use this model just as you would the original XCodec2 or Anime-XCodec2 models. The core inference logic remains the same. For a complete, working code example, please refer to my Hugging Face Spaces Demo: https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-Demo Language scope: Optimized for Japanese. Performance on other languages may degrade. Content domain: Tuned toward anime/game-style voices. Library Dependency: Requires the custom `xcodec2` library linked above. It is not compatible with the original `xcodec2` library. ~22,000 hours of Japanese speech, with a focus on anime/game-style voices. Data was prepared for 44.1kHz target output during training. Base Model: `NandemoGHS/Anime-XCodec2` (16kHz) Architecture Modification: Integrated the `UpSamplerBlock` from the Inworld TTS-1 implementation into the decoder. Upsampler parameters: `hoplength=147`, `upsamplefactors=[3, 2]`, `kernelsizes=[7, 6]`. Loss Function: Adopted RMS Loss (Root Mean Square loss), as introduced in Inworld TTS-1, in addition to the original losses. Frozen: Encoder and Codebook (token compatibility preserved). Updated (fine-tuned): `generator.backbone`, `generator.head`, `generator.upsampler`, `fcposta` | ID | Original (reference) | Baseline Reconstruct (`HKUSTAudio/xcodec2`) [16kHz] | `Anime-XCodec2` [16kHz] | `Anime-XCodec2-44.1kHz` (This model) [44.1kHz] | | -: | :--- | :--- | :--- | :--- | | 1 | | | | | | 2 | | | | | | 3 | | | | | Note: Original audio is 48 / 44.1 kHz. Baseline and Anime-XCodec2 are 16 kHz. This model outputs 44.1 kHz. CC-BY-NC 4.0 (inherited from XCodec2 and Anime-XCodec2). See: https://creativecommons.org/licenses/by-nc/4.0/ HKUSTAudio/xcodec2 (Original model) Inworld AI for their work on Inworld TTS-1 (Upsampler architecture and RMS Loss). Thanks to contributors and the community around Japanese speech resources.
Anime XCodec2 44.1kHz V2
Anime-XCodec2-44.1kHz-v2: A 44.1kHz Upsampling Variant of Anime-XCodec2 (v2) [](https://creativecommons.org/licenses/by-nc/4.0/) TL;DR: `Anime-XCodec2-44.1kHz-v2` is a fine-tuned variant of NandemoGHS/Anime-XCodec2. It incorporates upsampling layers and RMS loss (inspired by Inworld TTS-1) to produce 44.1kHz output, trained on ~22k hours of Japanese speech. This v2 updates upsampler parameters, loss configurations, and fixes a RoPE bug from the original XCodec2. Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa). Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo This repository (v2 44.1kHz fine-tune): `NandemoGHS/Anime-XCodec2-44.1kHz-v2` Baseline 16kHz model: `NandemoGHS/Anime-XCodec2` Original XCodec2: `HKUSTAudio/xcodec2` Reference Paper (Inworld TTS-1): https://arxiv.org/abs/2507.21138 Reference Implementation (Inworld TTS): https://github.com/inworld-ai/tts What it is: A neural speech codec based on Anime-XCodec2 (which is based on XCodec2), fine-tuned to output 44.1kHz high-fidelity Japanese speech (anime/game-style). (Version 2) Key Change: Integrates an UpSamplerBlock and utilizes RMS Loss (inspired by Inworld TTS-1) into the decoder architecture. Training scope: Decoder-only fine-tuning on ~22,000 hours of Japanese data. Encoder and codebook are frozen. Compatibility: Speech tokens are identical to `HKUSTAudio/xcodec2` and `NandemoGHS/Anime-XCodec2`. Input Sampling rate: 16 kHz (for encoding, same as XCodec2). Output Sampling rate: 44.1 kHz (decoded audio). Decode XCodec2 speech tokens (e.g., from Llasa or other AR generators) into high-fidelity 44.1kHz Japanese speech (anime/game-style). Upgrade existing `Anime-XCodec2` (16kHz) pipelines to 44.1kHz output. Audio Super-Resolution: As the model accepts 16kHz input and outputs 44.1kHz reconstructed audio, it can also be used as a form of audio super-resolution. However, its performance for this specific purpose is untested/unevaluated. This model modifies the original XCodec2 architecture (upsampler blocks) and requires a custom library version that includes a fix for the RoPE bug (Issue #36). You MUST use the provided custom `xcodec2` library fork (v0.1.7 or later) for inference. The standard library or older custom libraries (like 0.1.6) will not work. Usage: Once the custom library is installed, you can load and use this model just as you would the original XCodec2 or Anime-XCodec2 models. The core inference logic remains the same. For a complete, working code example, please refer to my Hugging Face Spaces Demo: https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo Language scope: Optimized for Japanese. Performance on other languages may degrade. Content domain: Tuned toward anime/game-style voices. Library Dependency: Requires the specific custom `xcodec2` library (v0.1.7) linked above. It is not compatible with the original `xcodec2` library or previous custom forks (e.g., v0.1.6). ~22,000 hours of Japanese speech, with a focus on anime/game-style voices. Data was prepared for 44.1kHz target output during training. Base Model: `NandemoGHS/Anime-XCodec2` (16kHz) Architecture Modification: Integrated the `UpSamplerBlock` from the Inworld TTS-1 implementation into the decoder. Loss Function: Adopted RMS Loss (Root Mean Square loss) (from Inworld TTS-1), in addition to original losses. Frozen: Encoder and Codebook (token compatibility preserved). Updated (fine-tuned): `generator.backbone`, `generator.head`, `generator.upsampler`, `fcposta` Compared to the first version, this v2 model includes the following key updates to the training configuration: 1. RoPE Bug Fix: Corrected a RoPE (Rotary Position Embedding) bug present in the original XCodec2 implementation (See Issue #36). 2. Upsampler Parameters: The upsampler settings were changed to `hoplength=98`, `upsamplefactors=[3, 3]`, and `kernelsizes=[9, 9]`. 3. Perceptual Loss Model: The model used for calculating perceptual loss was switched from facebook/wav2vec2-large-xlsr-53 to imprt/kushinada-hubert-large. 4. Spectral Discriminator Tuning: The STFT (Short-Time Fourier Transform) settings for the spectral discriminator were adjusted to be more suitable for 44.1kHz high-sampling-rate audio. CC-BY-NC 4.0 (inherited from XCodec2 and Anime-XCodec2). See: https://creativecommons.org/licenses/by-nc/4.0/ HKUSTAudio/xcodec2 (Original model) Inworld AI for their work on Inworld TTS-1 (Upsampler architecture and RMS Loss). imprt for the `kushinada-hubert-large` model used in perceptual loss. Thanks to contributors and the community around Japanese speech resources.
Anime-Llasa-3B-FP8
Galgame-Orpheus-3B
Anime-Speech-Japanese-Refiner
This model is a fine-tuned version of Qwen/Qwen3-Omni-30B-A3B-Instruct. This is an audio processing model specialized for Japanese anime-style or game-style speech. It takes an audio input and its original transcription (text) to generate detailed descriptions (emotion, profile, etc.) and a refined transcription that includes non-speech events (e.g., breaths, sighs). It was fine-tuned using the NandemoGHS/GalgameGeminiCaptions dataset. The training was conducted using the ms-swift library with the Megatron Backend. Demo: https://huggingface.co/spaces/OmniAICreator/Anime-Speech-Japanese-Refiner-Demo This model is specifically designed for Japanese game-style or anime-style speech. Due to the nature of its training data, it is not expected to perform well on: Languages other than Japanese. General conversational speech (e.g., meetings, casual dialogue). This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing). It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source: This requirement will likely be unnecessary after the `v0.11.1` release. This is the output generated for this example audio and its transcription. For a more detailed walkthrough, please see the inference\example.ipynb notebook. (Note: You will need to adapt the prompt for this Refiner model). The model outputs a structured description of the audio in Japanese, following this format: Furthermore, the training data utilized outputs from Gemini 2.5 Pro. Therefore, any use that competes with or violates the terms of service of Gemini is strictly prohibited.
Anime-Speech-Japanese-Captioner
This model is a fine-tuned version of Qwen/Qwen3-Omni-30B-A3B-Captioner. This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption. It was fine-tuned using the NandemoGHS/GalgameGeminiCaptions dataset. The training was conducted using the ms-swift library with the Megatron Backend. This model is specifically designed for Japanese game-style or anime-style speech. Due to the nature of its training data, it is not expected to perform well on: Languages other than Japanese. General conversational speech (e.g., meetings, casual dialogue). This model requires building `vLLM` from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing). It has been tested and confirmed to work with commit `18961c5ea62976efc50525b72e40337993c5e4f9`. You must build vLLM from source: This requirement will likely be unnecessary after the `v0.11.1` release. For a more detailed walkthrough, please see the inference\example.ipynb notebook. The model outputs a structured description of the audio in Japanese, following this format: Furthermore, the training data utilized outputs from Gemini 2.5 Pro. Therefore, any use that competes with or violates the terms of service of Gemini is strictly prohibited.
Anime-Speech-Japanese-Captioner-FP8-DYNAMIC
This is the FP8-DYNAMIC quantized version of NandemoGHS/Anime-Speech-Japanese-Captioner. For detailed information on how to use the model, inference examples, vLLM setup, and output formats, please refer to the README of the original model page. Furthermore, the training data utilized outputs from Gemini 2.5 Pro. Therefore, any use that competes with or violates the terms of service of Gemini is strictly prohibited.
Anime-Speech-Japanese-Refiner-FP8-DYNAMIC
This is the FP8-DYNAMIC quantized version of NandemoGHS/Anime-Speech-Japanese-Refiner. For detailed information on how to use the model, inference examples, vLLM setup, and output formats, please refer to the README of the original model page. Furthermore, the training data utilized outputs from Gemini 2.5 Pro. Therefore, any use that competes with or violates the terms of service of Gemini is strictly prohibited.