UniAnimate-DiT: Human Image Animation with Large-Scaled Video Diffusion Transformer
This repo contains checkpoints for UniAnimate-DiT. The model is described in the paper UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer and Replace Anyone in Videos.
- UniAnimate-Wan2.1-14B-Lora-12000.ckpt: the weights of LoRAs and additional learnable modules with 12000 training steps.
- dw-llucoco384.onnx: dwpose model used for pose extraction.
UniAnimate-DiT is based on a state-of-the-art DiT-based Wan2.1-14B-I2V model for consistent human image animation. Wan2.1 is a collection of video synthesis models open-sourced by Alibaba. Our code is based on DiffSynth-Studio, thanks for the nice open-sourced project.
Before using this model, please create the conda environment and install DiffSynth-Studio from source code.
UniAnimate-DiT supports multiple Attention implementations. If you have installed any of the following Attention implementations, they will be enabled based on priority.
Flash Attention 3 Flash Attention 2 Sage Attention torch SDPA (default. `torch>=2.5.0` is recommended.)
Download Wan2.1-14B-I2V-720P models using huggingface-cli:
Or download Wan2.1-14B-I2V-720P models using modelscope-cli:
Download pretrained UniAnimate-DiT models (only include the weights of lora and additional learnable modules):
Finally, the model weights will be organized in `./checkpoints/` as follows:
Rescale the target pose sequence to match the pose of the reference image (you can also install `pip install onnxruntime-gpu==1.18.1` for faster extraction on GPU.):
The processed target pose for demo videos will be in . `--refname` denotes the path of reference image, `--sourcevideopaths` provides the source poses, `--savedposedir` means the path of processed target poses.
(4) Run UniAnimate-Wan2.1-14B-I2V to generate 480P videos
About 23G GPU memory is needed. After this, 81-frame video clips with 832x480 (hight x width) resolution will be generated under the `./outputs` folder:
For long video generation, run the following comment:
(5) Run UniAnimate-Wan2.1-14B-I2V to generate 720P videos
About 36G GPU memory is needed. After this, 81-frame video clips with 1280x720 resolution will be generated:
Note: Even though our model was trained on 832x480 resolution, we observed that direct inference on 1280x720 is usually allowed and produces satisfactory results.
For long video generation, run the following comment:
We support UniAnimate-DiT training on our own dataset.
In order to speed up the training, we preprocessed the videos, extracted video frames and corresponding Dwpose in advance, and packaged them with pickle package. You need to manage the training data as follows:
We encourage adding large amounts of data to finetune models to get better results. The experimental results show that about 1000 training videos can finetune a good human image animation model.
For convenience, we do not pre-process VAE features, but put VAE pre-processing and DiT model training in a training script, and also facilitate data augmentation to improve performance. You can also choose to extract VAE features first and then conduct subsequent DiT model training.
You can also finetune our trained model by set `--pretrainedlorapath="./checkpoints/UniAnimate-Wan2.1-14B-Lora.ckpt"`.
Test the LoRA finetuned model trained on multi-GPUs based on Deepspeed, first you need `python zerotofp32.py . outputdir/ --safeserialization` to change the .pt files to .safetensors files, and then run:
If you find this codebase useful for your research, please cite the following paper:
This project is intended for academic research, and we explicitly disclaim any responsibility for user-generated content. Users are solely liable for their actions while using the generative model. The project contributors have no legal affiliation with, nor accountability for, users' behaviors. It is imperative to use the generative model responsibly, adhering to both ethical and legal standards.