ASLP-lab

27 models • 2 total models in database

Sort by:

DiffRhythm2

Di♪♪Rhythm 2: Efficient And High Fidelity Song Generation Via Block Flow Matching Yuepeng Jiang, Huakang Chen, Ziqian Ning, Jixun Yao, Zerui Han, Di Wu, Meng Meng, Jian Luan, Zhonghua Fu, Lei Xie† DiffRhythm 2   |   📑 Paper   |   🎵 Demo --> DiffRhythm 2 (Chinese: 谛韵, Dì Yùn) is the next-generation open-sourced music generation framework that advances the original DiffRhythm with a semi-autoregressive diffusion architecture. It is capable of generating full-length songs with precise lyric alignment and coherent musical structures. The name inherits the essence of DiffRhythm — “Diff” reflects its diffusion-based generative backbone, while “Rhythm” emphasizes its dedication to musicality and temporal flow. The Chinese name 谛韵 (Dì Yùn) continues this dual symbolism: “谛” (attentive listening) represents perceptual awareness, and “韵” (melodic charm) captures the expressive beauty of music. 2025.10.30 🚀 We released the DiffRhythm2 paper, demo code, and model weights. 📋 TODOs - [ ] Support Colab. - [ ] Gradio support. - [ ] Song extension. - [ ] Instrumental music generation. - [x] Release code and weights. - [x] Release paper to Arxiv. Following the steps below to clone the repository and install the environment. On Linux you can now simply use the inference script: Weights will be automatically downloaded from Hugging Face upon the first run. Example files of lyrics and reference audio can be found in `example`. DiffRhythm 2 (code and weights) is released under the Apache License 2.0. This open-source license allows you to freely use, modify, and distribute the model, as long as you include the appropriate copyright notice and disclaimer. We do not make any profit from this model. Our goal is to provide a high-quality base model for music generation, fostering innovation in AI music and contributing to the advancement of human creativity. We hope that DiffRhythm 2 will serve as a foundation for further research and development in the field of AI-generated music. DiffRhythm 2 enables the creation of original music across diverse genres, supporting applications in artistic creation, education, and entertainment. While designed for positive use cases, potential risks include unintentional copyright infringement through stylistic similarities, inappropriate blending of cultural musical elements, and misuse for generating harmful content. To ensure responsible deployment, users must implement verification mechanisms to confirm musical originality, disclose AI involvement in generated works, and obtain permissions when adapting protected styles.

license:apache-2.0

127

SongFormer

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision [](https://arxiv.org/abs/2510.02797) [](https://github.com/ASLP-lab/SongFormer) [](https://huggingface.co/spaces/ASLP-lab/SongFormer) [](https://huggingface.co/ASLP-lab/SongFormer) [](https://huggingface.co/datasets/ASLP-lab/SongFormDB) [](https://huggingface.co/datasets/ASLP-lab/SongFormBench) [](https://discord.gg/p5uBryC4Zs) [](http://www.npu-aslp.org/) Chunbo Hao 1 , Ruibin Yuan 2,5 , Jixun Yao 1 , Qixin Deng 3,5 , Xinyi Bai 4,5 , Wei Xue 2 , Lei Xie 1† Equal contribution    † Corresponding author 1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University 2 Hong Kong University of Science and Technology 3 Northwestern University 4 Cornell University 5 Multimodal Art Projection (M-A-P) SongFormer is a music structure analysis framework that leverages multi-resolution self-supervised representations and heterogeneous supervision, accompanied by the large-scale multilingual dataset SongFormDB and the high-quality benchmark SongFormBench to foster fair and reproducible research. For a more detailed deployment guide, please refer to the GitHub repository. Before running the model, follow the instructions in the GitHub repository to set up the required Python environment. You can perform inference by providing the path to an audio file: Alternatively, you can directly feed a raw audio waveform as a NumPy array or PyTorch tensor: > ⚠️ Note: The expected sampling rate for input audio is 24,000 Hz. The model returns a structured list of segment predictions, with each entry containing timing and label information: - The initialization logic of MusicFM has been modified to eliminate the need for loading checkpoint files during instantiation, improving both reliability and startup efficiency. If you use SongFormer in your research or application, please cite our work:

—

VoiceSculptor-VD

llama

DiffRhythm-1_2

—

DiffRhythm-1_2-full

license:apache-2.0

DiffRhythm-base

license:apache-2.0

169

Llasa-1B-Yue

NaNK

llama

LLasa-1B-Yue-Updated

NaNK

llama

SenSE

Github: https://github.com/ASLP-lab/SenSE Paper: SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

—

Easy Turn

Easy Turn: Integrating Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems Guojian Li 1 , Chengyou Wang 1 , Hongfei Xue 1 , Shuiyuan Wang 1 , Dehui Gao 1 , Zihan Zhang 2 , Yuke Lin 2 , Wenjie Li 2 , Longshuai Xiao 2 , Zhonghua Fu 1 ,╀ , Lei Xie 1 ,╀ 1 Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University 2 Huawei Technologies, China | 🎤 Demo Page | 🤖 Easy Turn Model | 📑 Paper | 🌐 Huggingface | |:---:|:---:|:---:|:---:| Download The Easy Turn resources are available at Model, Trainset, and Testset. Easy Turn Full-duplex interaction is crucial for natural human–machine communication, yet remains challenging as it requires robust turn-taking detection to decide when the system should speak, listen, or remain silent. Existing solutions either rely on dedicated turn-taking models, most of which are not open-sourced. The few available ones are limited by their large parameter size or by supporting only a single modality, such as acoustic or linguistic. Alternatively, some approaches finetune LLM backbones to enable full-duplex capability, but this requires large amounts of full-duplex data, which remain scarce in open-source form. To address these issues, we propose Easy Turn—an open-source, modular turn-taking detection model that integrates acoustic and linguistic bimodal information to predict four dialogue turn states: complete (semantically complete), incomplete (semantically incomplete), backchannel (brief feedback), and wait (request to pause or end the dialogue), accompanied by the release of Easy Turn trainset, a 1,145-hour speech dataset designed for training turn-taking detection models. Compared to existing open-source models like TEN Turn Detection and Smart Turn V2, our model achieves state-of-the-art turn-taking detection accuracy on our open-source Easy Turn testset. Easy Turn Trainset The Easy Turn Trainset is a large-scale audio dataset for turn-taking detection, comprising both real and synthetic data. It contains four subsets corresponding to different conversational turn-taking states: 580 hours of complete state, 532 hours of incomplete state, 10 hours of backchannel state, and 23 hours of wait state, totaling approximately 1,100 hours. Each recording is accompanied by a text transcription and labeled with one of the four turn-taking states. EXPERIMENTS Main Results We evaluate Easy Turn against two open-source turn-taking detection models, TEN Turn Detection and Smart Turn V2, using the Easy Turn testset. All experiments are conducted on a single NVIDIA RTX 4090 GPU. Notably, since TEN Turn Detection lacks direct speech support, we use Paraformer as the ASR model to transcribe speech into text and take the text as its input. The table below reports the results: ACCcp, ACCincp, ACCbc and ACCwait denote the turn-taking detection accuracy for complete, incomplete, backchannel, and wait states (higher is better). Params, Latency, and Memory represent total model size, average inference time, and GPU usage, where lower values indicate greater efficiency. | Model | Params (MB) ↓ | Latency (ms) | Memory (MB) | ACCcp (%) ↑ | ACCincp (%) | ACCbc (%) | ACCwait (%) | |-------------------------------|---------------|--------------|-------------|--------------|--------------|------------|--------------| | Paraformer + TEN Turn Detection | 7220 | 204 | 15419 | 86.67 | 89.3 | – | 91 | | Smart Turn V2 | 95 | 27 | 370 | 78.67 | 62 | – | – | | Easy Turn (Proposed) | 850 | 263 | 2559 | 96.33 | 97.67 | 91 | 98 | Examples We present several examples of Easy Turn applications in spoken dialogue systems. The content inside the angle brackets indicates the dialogue turn state detected by Easy Turn, while the text in parentheses represents the actions the system should take based on the detected dialogue turn state. To evaluate its performance in turn-taking detection, we deploy Easy Turn in our laboratory spoken dialogue system OSUM-EChat, where human users interact with the system through microphone input. The results show that Easy Turn performs effectively, accurately identifying dialogue turn states and enabling the system to respond appropriately. For the actual effect demonstration, you can refer to our Demo Page. Environment Following the steps below to clone the repository and install the environment. This project supports three types of data: raw, shard. Data is stored in jsonl format, one JSON object per line, with the following fields: Data is packed into tar files, storing multiple entries together for efficient bulk loading. Start training Set stage = 0 and stopstage = 0 for model training. After training, set stage = 1 and stopstage = 1 for model merging. See the shell script for details. Inference Please first download the Easy Turn's checkpoint at Easy Turn. Citation Please cite our paper if you find this work useful:

license:apache-2.0

OSUM-EChat

license:apache-2.0

DiffRhythm-vae

—

LLaSE-G1

license:apache-2.0

OSUM

license:apache-2.0

WSYue-ASR

license:apache-2.0

WSChuan-ASR

ASR Leaderboard | Model | Model Size | WSC-Eval-ASR - Easy | WSC-Eval-ASR - Hard | WSC-Eval-ASR - Total | Magicdata - Conversation | Magicdata - Daily-Use | Avg. | | --- | --- | --- | --- | --- | --- | --- | --- | | with LLM | | | | | | | | | Kimi-Audio | 7B | 16.65 | 28.66 | 17.66 | 24.67 | 5.77 | 18.68 | | FireRedASR-LLM | 8.3B | 12.80 | 25.27 | 14.40 | 17.68 | 6.69 | 15.37 | | Qwen2.5-omni | 3B | 16.94 | 26.01 | 18.20 | 20.40 | 6.32 | 17.69 | | Qwen2.5-omni-WSC-Finetune⭐ | 3B | 14.36 | 24.14 | 15.61 | 18.45 | 6.15 | 15.74 | | Qwen2.5-omni+internal data⭐ | 3B | 13.17 | 23.36 | 14.81 | 18.50 | 5.88 | 15.14 | | Qwen2.5-omni-WSC-Finetune + internal data⭐ | 3B | 12.93 | 23.19 | 14.25 | 17.95 | 5.89 | 14.84 | | without LLM | | | | | | | | | SenseVoice-small | 234M | 17.43 | 28.38 | 18.39 | 23.50 | 8.77 | 19.29 | | Whisper | 244M | 52.06 | 63.99 | 53.59 | 55.88 | 52.03 | 55.51 | | FireRedASR-AED | 1.1B | 13.29 | 23.64 | 14.62 | 17.84 | 6.69 | 15.14 | | Paraformer | 220M | 14.34 | 24.61 | 15.66 | 19.81 | 8.16 | 16.52 | | Paraformer-WSC-Finetune⭐ | 220M | 12.15 | 22.60 | 13.51 | 16.60 | 8.02 | 14.58 | | Paraformer + internal data⭐ | 220M | 11.93 | 21.82 | 13.14 | 15.61 | 6.77 | 13.85 | | Paraformer-WSC-Finetune + internal data ⭐ | 220M | 11.59 | 21.59 | 12.87 | 14.59 | 6.28 | 13.38 |

license:apache-2.0

WSYue-TTS

👉🏻 WenetSpeech-Yue 👈🏻 WenetSpeech-Yue: Demos; Paper; Github; HuggingFace 📢 News and Updates - 2025.11.15 🚀 Released Llasa-1B-Yue-Updated！ You can access the complete inference pipeline and scripts on WenetSpeech-Yue WenetSpeech-Yue TTS Models have been released! This repository contains two versions of the TTS models: 1. ASLP-lab/Cosyvoice2-Yue: The base model for Cantonese TTS. 2. ASLP-lab/Cosyvoice2-Yue-ZoengJyutGaai: A fine-tuned, higher-quality version for more natural speech generation. Contact If you are interested in leaving a message to our research team, feel free to email [email protected] or [email protected].

license:apache-2.0

Emotion2Vec-S

license:apache-2.0

OSUM-Pangu

license:apache-2.0

LLaSA_Plus

license:apache-2.0

Cosyvoice2-Yue

👉🏻 WenetSpeech-Yue 👈🏻 WenetSpeech-Yue: Demos; Paper; Github; HuggingFace WenetSpeech-Yue TTS Models have been released! This repository contains two versions of the TTS models: 1. ASLP-lab/Cosyvoice2-Yue: The base model for Cantonese TTS. 2. ASLP-lab/Cosyvoice2-Yue-ZoengJyutGaai: A fine-tuned, higher-quality version for more natural speech generation. - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html - Create Conda env: We strongly recommend using `CosyVoice2-0.5B` for better performance. Follow code below for detailed usage of each model. Contact If you are interested in leaving a message to our research team, feel free to email [email protected] or [email protected].

license:apache-2.0

I-OSUM-Pangu

—

VoiceSculptor

NaNK

—

WSChuan-TTS

``` python import sys sys.path.append('thirdparty/Matcha-TTS') from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2 from cosyvoice.utils.fileutils import loadwav import torchaudio import opencc cosyvoicebase = CosyVoice2( 'pretrainedmodels/Cosyvoice2-Chuan', loadjit=False, loadtrt=False, loadvllm=False, fp16=False ) promptspeech16k = loadwav('asset/sg017090.wav', 16000) for i, j in enumerate(cosyvoicebase.inferenceinstruct2(text, '用四川话说这句话', promptspeech16k, stream=False)): torchaudio.save('base{}.wav'.format(i), j['ttsspeech'], cosyvoice.samplerate) Contact If you are interested in leaving a message to our research team, feel free to email [email protected].

license:apache-2.0

ASLP-lab

YingMusic-Singer-Plus

YingMusic-Singer

DiffRhythm-full

DiffRhythm2

SongFormer

VoiceSculptor-VD

DiffRhythm-1_2

DiffRhythm-1_2-full

DiffRhythm-base

Llasa-1B-Yue

LLasa-1B-Yue-Updated

SenSE

Easy Turn

OSUM-EChat

DiffRhythm-vae

LLaSE-G1

OSUM

WSYue-ASR

WSChuan-ASR

WSYue-TTS

Emotion2Vec-S

OSUM-Pangu

LLaSA_Plus

Cosyvoice2-Yue

I-OSUM-Pangu

VoiceSculptor

WSChuan-TTS