AudioX-MAF
33
license:cc-by-nc-4.0
by
HKUSTAudio
Audio Model
OTHER
New
33 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
Unknown
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Code Examples
Load pretrained modelpythonpytorch
import torch
import torchaudio
from einops import rearrange
from audiox import get_pretrained_model
from audiox.inference.generation import generate_diffusion_cond
from audiox.data.utils import read_video, merge_video_audio, load_and_process_audio, encode_video_with_synchformer
import os
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load pretrained model
# Choose one: "HKUSTAudio/AudioX", "HKUSTAudio/AudioX-MAF", or "HKUSTAudio/AudioX-MAF-MMDiT"
model_name = "HKUSTAudio/AudioX"
model, model_config = get_pretrained_model(model_name)
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
target_fps = model_config["video_fps"]
seconds_start = 0
seconds_total = 10
model = model.to(device)
# Example: Video-to-Music generation
video_path = "example/V2M_sample-1.mp4"
text_prompt = "Generate music for the video"
audio_path = None
# Prepare inputs
video_tensor = read_video(video_path, seek_time=seconds_start, duration=seconds_total, target_fps=target_fps)
if audio_path:
audio_tensor = load_and_process_audio(audio_path, sample_rate, seconds_start, seconds_total)
else:
# Use zero tensor when no audio is provided
audio_tensor = torch.zeros((2, int(sample_rate * seconds_total)))
# For AudioX-MAF and AudioX-MAF-MMDiT: encode video with synchformer
video_sync_frames = None
if "MAF" in model_name:
video_sync_frames = encode_video_with_synchformer(
video_path, model_name, seconds_start, seconds_total, device
)
# Create conditioning
conditioning = [{
"video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": video_sync_frames},
"text_prompt": text_prompt,
"audio_prompt": audio_tensor.unsqueeze(0),
"seconds_start": seconds_start,
"seconds_total": seconds_total
}]
# Generate audio
output = generate_diffusion_cond(
model,
steps=250,
cfg_scale=7,
conditioning=conditioning,
sample_size=sample_size,
sigma_min=0.3,
sigma_max=500,
sampler_type="dpmpp-3m-sde",
device=device
)
# Post-process audio
output = rearrange(output, "b d n -> d (b n)")
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)Citationbibtex
@article{tian2025audiox,
title={AudioX: Diffusion Transformer for Anything-to-Audio Generation},
author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2503.10522},
year={2025}
}Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.