Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Chetwin Low 1 , Weimin Wang † 1 , Calder Katyal 2 Equal contribution, † Project Lead 1 Character AI, 2 Yale University
Ovi 1.1 ā 10-second temporally consistent video generation (960 Ć 960 resolution)
š Ovi 1.1 Update (10 November 2025) - Key Feature: Enables temporal-consistent 10-second video generation at 960 Ć 960 resolution - Training Improvements: - Trained natively on 960Ć960 resolution videos - Dataset includes 100% more videos for greater diversity - - Prompt Format Update: - Audio descriptions should now be written as
š Key Features Ovi is a veo-3-like, video + audio generation model that simultaneously generates both video and audio content from text or text + image inputs. - š¬ Video+Audio Generation: Generate synchronized video and audio content simultaneously - šµ High-Quality Audio Branch: We designed and pretrained our 5B audio branch from scratch using our high quality in-house audio datasets - š Flexible Input: Supports text-only or text+image conditioning - ā±ļø 10-second (or 5-second) Videos: Generates 10-second or 5-second videos at 24 FPS, resolution of 960x960p, at various aspect ratios (9:16, 16:9, 1:1, etc) - š§ ComfyUI Integration: ComfyUI support is now available via ComfyUI-WanVideoWrapper, related PR. - š¬ Create videos now on wavespeed.ai: https://wavespeed.ai/models/character-ai/ovi/image-to-video & https://wavespeed.ai/models/character-ai/ovi/text-to-video - š¬ Create videos now on HuggingFace: https://huggingface.co/spaces/akhaliq/Ovi
Click the ā¶ button on any video to view full screen.
Click the ā¶ button on any video to view full screen.
- [x] Release research paper and website for demos - [x] Checkpoint of 11B model - [x] Inference Codes - [x] Text or Text+Image as input - [x] Gradio application code - [x] Multi-GPU inference with or without the support of sequence parallel - [x] fp8 weights and improved memory efficiency (credits to @rkfg) - [x] qint8 quantization thanks to @gluttony-10 - [ ] Improve efficiency of Sequence Parallel implementation - [ ] Implement Sharded inference with FSDP - [x] Video creation example prompts and format - [x] Finetune model with higher resolution data, and RL for performance improvement. - [x] Longer video generation (10s) - [ ] Reference voice condition - [ ] Distilled model for faster inference - [ ] Training scripts
We provide example prompts to help you get started with Ovi: - Text-to-Audio-Video (T2AV) 10s: `exampleprompts/gptexamplest2v.csv` - Image-to-Audio-Video (I2AV) 10s: `exampleprompts/gptexamplesi2v.csv` - Text-to-Audio-Video (T2AV): `exampleprompts/gptexamplest2v.csv` - Image-to-Audio-Video (I2AV): `exampleprompts/gptexamplesi2v.csv`
Our prompts use special tags to control speech and audio: - Speech: ` Your speech content here ` - Text enclosed in these tags will be converted to speech - Audio Description: `Audio: YOUR AUDIO DESCRIPTION` - Describes the audio or sound effects present in the video at the end of prompt!
Alternative Flash Attention Installation (Optional) If the above flashattn installation fails, you can try the Flash Attention 3 method:
Download Weights To download our main Ovi checkpoint, as well as T5 and vae decoder from Wan, and audio vae from MMAudio
Ovi's behavior and output can be customized by modifying ovi/configs/inference/inferencefusion.yaml configuration file. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:
Use this for single GPU setups. The `textprompt` can be a single string or path to a CSV file.
Use this to run samples in parallel across multiple GPUs for faster processing.
Memory & Performance Requirements Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future. All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is 32Gb, fp8 parameters is currently supported, reducing peak VRAM usage to 24Gb with slight quality degradation.
| Sequence Parallel Size | FlashAttention-3 Enabled | CPU Offload | With Image Gen Model | Peak VRAM Required | End-to-End Time | |-------------------------|---------------------------|-------------|-----------------------|---------------|-----------------| | 1 | Yes | No | No | ~80 GB | ~83s | | 1 | No | No | No | ~80 GB | ~96s | | 1 | Yes | Yes | No | ~80 GB | ~105s | | 1 | No | Yes | No | ~32 GB | ~118s | | 1 | Yes | Yes | Yes | ~32 GB | ~140s | | 4 | Yes | No | No | ~80 GB | ~55s | | 8 | Yes | No | No | ~80 GB | ~40s | Gradio We provide a simple script to run our model in a gradio UI. It uses the `ckptdir` in `ovi/configs/inference/inferencefusion.yaml` to initialize the model
- Wan2.2: Our video branch is initialized from the Wan2.2 repository - MMAudio: We reused MMAudio's audio vae.
We welcome all types of collaboration! Whether you have feedback, want to contribute, or have any questions, please feel free to reach out.
If you find this project useful for your research, please consider citing our paper.