disco-eth
cue-detr
discoder
Sao Instruct
SAO-Instruct: Free-form Audio Editing using Natural Language Instructions Paper | Sample Page | Code SAO-Instruct is a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. Inference To get started, clone the repository and install the dependencies: Use the following script to perform inference with SAO-Instruct weights from 🤗 Hugging Face. When `encodeaudio` is set to `True`, the provided audio is encoded into the latent space and used as a starting point for generation. You can control the amount of noise added to the encoded audio using the `encodedaudionoise` parameter. Experiment with different configurations to achieve optimal results. Data Generation The required files to generate audio editing triplets are in the `dataset/` folder. Prompt Generation The script `generateprompts.py` can be used for prompt generation. It accepts a `.jsonl` file as input in the following form: This input `.jsonl` file can be created using the `preparecaptions.py` script for AudioCaps, WavCaps, and AudioSetSL. If you download audio clips from captioning datasets (e.g., if you want to use DDPM inversion for paired sample generation), the `metadata` field can be used to match them to their specific filename. The output of this script is a `.jsonl` file that includes processed prompts, containing the input caption, edit instruction, and output caption. Paired Sample Generation Prompt-to-Prompt After generating prompts, you can use Prompt-to-Prompt to generate a synthetic dataset of edited audio pairs. The Prompt-to-Prompt pipeline consists of two parts: - Candidate Search: Searching for ideal candidates (CFG, seed) for all prompts in the prompt file. - Sample Generation: Generating the edited audio pairs using the candidates found in the previous step. Use the script `generatecandidates.py` for the candidate search. The script `generatesamples.py` can be used for Prompt-to-Prompt sample generation (use the mode `p2p`). We have included the source code of Stable Audio Open with the adaptations made for Prompt-to-Prompt in `audiogeneration/p2p/stable-audio-tools` (particularly in `audiogeneration/p2p/stable-audio-tools/models/transformer.py`). You can install its requirements using: Make sure that the `kdiffusion` package is configured to use the same starting noise. Change the function `sampledpmpp3msde` in the `kdiffusion/sampling.py` file to: DDPM Inversion The script `generatesamples.py` can be used to create samples using DDPM inversion (use the mode `edit`). We follow the implementation from the paper Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion. Clone the repository and install its dependencies using: Manual Edits For generating manual edits, use the script `manualedits/generatemanualsamples.py`. Fine-tuning Stable Audio Open We provide training and data loading scripts to enable fine-tuning on audio editing triplets: - `model/stable-audio-tools/trainedit.py` - Modified training script for audio editing tasks - `model/stable-audio-tools/stableaudiotools/data/datasetedit.py` - Custom dataset loader for editing triplets - `model/stable-audio-tools/stableaudiotools/configs` - Contains configuration files for both the model and dataset Otherwise, follow the official recommendations from Stable Audio Open to fine-tune the model. Attribution and License This repository builds upon Stable Audio Open, a model developed by Stability AI. It uses checkpoints and components from `stabilityai/stable-audio-open-1.0` that are licensed under the Stability AI Community License. Please see the NOTICE file for required attribution. Powered by Stability AI This repository and its contents are released for academic research and non-commercial use only.