amphion

20 models • 1 total models in database
Sort by:

MaskGCT

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [](https://arxiv.org/abs/2409.00750) [](https://huggingface.co/amphion/maskgct) [](https://huggingface.co/spaces/amphion/maskgct) [](https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct) | Model Name | Description | |-------------------|-------------| | Semantic Codec | Converting speech to semantic tokens. | | Acoustic Codec | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. | | MaskGCT-T2S | Predicting semantic tokens with text and prompt semantic tokens. | | MaskGCT-S2A | Predicts acoustic tokens conditioned on semantic tokens. | You can download all pretrained checkpoints from HuggingFace or use huggingface api. You can use the following code to generate speech from text and a prompt speech. We use the Emilia dataset to train our models. Emilia is a multilingual and diverse in-the-wild speech dataset designed for large-scale speech generation. In this work, we use English and Chinese data from Emilia, each with 50K hours of speech (totaling 100K hours). If you use MaskGCT in your research, please cite the following paper:

license:cc-by-nc-4.0
2,167
299

TaDiCodec

license:apache-2.0
54
27

anyaccomp

AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck This is the official Hugging Face model repository for AnyAccomp, an accompaniment generation framework from the paper AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck. AnyAccomp addresses two critical challenges in accompaniment generation: generalization to in-the-wild singing voices and versatility in handling solo instrumental inputs. The core of our framework is a quantized melodic bottleneck, which extracts robust melodic features. A subsequent flow matching model then generates a matching accompaniment based on these features. For more details, please visit our GitHub Repository. This repository contains the three pretrained components of the AnyAccomp framework: | Model Name | Directory | Description | | ----------------- | ---------------------------- | ------------------------------------------------- | | VQ | `./pretrained/vq` | Extracts core melodic features from audio. | | Flow Matching | `./pretrained/flowmatching` | Generates accompaniments from melodic features. | | Vocoder | `./pretrained/vocoder` | Converts generated features into audio waveforms. | To run this model, you need to follow the steps below: 1. Clone the repository and install the environment. 2. Run the Gradio demo / Inference script. In this section, follow the steps below to clone the repository and install the environment. 1. Clone the repository. 2. Install the environment following the guide below. We provide a simple Python script to download all the necessary pretrained models from Hugging Face into the correct directory. Before running the script, make sure you are in the `AnyAccomp` root directory. If you have trouble connecting to Hugging Face, you can try switching to a mirror endpoint before running the command: Before start installing, make sure you are under the `AnyAccomp` directory. If not, use `cd` to enter. Once the setup is complete, you can run the model using either the Gradio demo or the inference script. You can run the following command to interact with the playground: If you want to infer several audios, you can use the python inference script from folder. By default, the script loads input audio from `./example/input` and saves the results to `./example/output`. You can customize these paths in the inference script. If you use AnyAccomp in your research, please cite our paper:

license:cc-by-4.0
19
7

TaDiCodec-TTS-MGM

license:apache-2.0
19
3

Vevo

license:cc-by-nc-4.0
18
41

TaDiCodec-TTS-AR-Qwen2.5-0.5B

NaNK
license:apache-2.0
16
8

Vevo1.5

license:cc-by-nc-nd-4.0
14
19

Metis

license:cc-by-nc-4.0
13
25

TaDiCodec-TTS-AR-Qwen2.5-3B

NaNK
license:apache-2.0
7
5

valle

3
1

naturalspeech3_facodec

license:apache-2.0
0
87

singing_voice_conversion

license:mit
0
29

text_to_audio

license:mit
0
10

naturalspeech2_libritts

license:mit
0
8

valle_libritts

license:mit
0
4

hifigan_speech_bigdata

license:mit
0
4

dualcodec

license:apache-2.0
0
4

dualcodec-tts

0
4

BigVGAN_singing_bigdata

license:mit
0
2

valle_librilight_6k

license:mit
0
1