amphion
MaskGCT
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [](https://arxiv.org/abs/2409.00750) [](https://huggingface.co/amphion/maskgct) [](https://huggingface.co/spaces/amphion/maskgct) [](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) | Model Name | Description | |-------------------|-------------| | Semantic Codec | Converting speech to semantic tokens. | | Acoustic Codec | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. | | MaskGCT-T2S | Predicting semantic tokens with text and prompt semantic tokens. | | MaskGCT-S2A | Predicts acoustic tokens conditioned on semantic tokens. | You can download all pretrained checkpoints from HuggingFace or use huggingface api. You can use the following code to generate speech from text and a prompt speech. We use the Emilia dataset to train our models. Emilia is a multilingual and diverse in-the-wild speech dataset designed for large-scale speech generation. In this work, we use English and Chinese data from Emilia, each with 50K hours of speech (totaling 100K hours). If you use MaskGCT in your research, please cite the following paper:
TaDiCodec
anyaccomp
AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck This is the official Hugging Face model repository for AnyAccomp, an accompaniment generation framework from the paper AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck. AnyAccomp addresses two critical challenges in accompaniment generation: generalization to in-the-wild singing voices and versatility in handling solo instrumental inputs. The core of our framework is a quantized melodic bottleneck, which extracts robust melodic features. A subsequent flow matching model then generates a matching accompaniment based on these features. For more details, please visit our GitHub Repository. This repository contains the three pretrained components of the AnyAccomp framework: | Model Name | Directory | Description | | ----------------- | ---------------------------- | ------------------------------------------------- | | VQ | `./pretrained/vq` | Extracts core melodic features from audio. | | Flow Matching | `./pretrained/flowmatching` | Generates accompaniments from melodic features. | | Vocoder | `./pretrained/vocoder` | Converts generated features into audio waveforms. | To run this model, you need to follow the steps below: 1. Clone the repository and install the environment. 2. Run the Gradio demo / Inference script. In this section, follow the steps below to clone the repository and install the environment. 1. Clone the repository. 2. Install the environment following the guide below. We provide a simple Python script to download all the necessary pretrained models from Hugging Face into the correct directory. Before running the script, make sure you are in the `AnyAccomp` root directory. If you have trouble connecting to Hugging Face, you can try switching to a mirror endpoint before running the command: Before start installing, make sure you are under the `AnyAccomp` directory. If not, use `cd` to enter. Once the setup is complete, you can run the model using either the Gradio demo or the inference script. You can run the following command to interact with the playground: If you want to infer several audios, you can use the python inference script from folder. By default, the script loads input audio from `./example/input` and saves the results to `./example/output`. You can customize these paths in the inference script. If you use AnyAccomp in your research, please cite our paper: