ali-vilab

11 models • 3 total models in database

Sort by:

text-to-video-ms-1.7b

This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported. We Are Hiring! (Based in Beijing / Hangzhou, China.) If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us. The text-to-video generation diffusion model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video. This model is meant for research purposes. Please look at the model limitations and biases and misuse, malicious use and excessive use sections. - Developed by: ModelScope - Model type: Diffusion-based text-to-video generation model - Language(s): English - License: CC-BY-NC-ND - Resources for more information: ModelScope GitHub Repository, Summary. - Cite as: This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions. You can optimize for memory usage by enabling attention and VAE slicing and using Torch 2.0. This should allow you to generate videos up to 25 seconds on less than 16GB of GPU VRAM. The above code will display the save path of the output video, and the current encoding format can be played with VLC player. The output mp4 file can be viewed by VLC media player. Some other media players may not view it normally. The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data. This model cannot achieve perfect film and television quality generation. The model cannot generate clear text. The model is mainly trained with English corpus and does not support other languages at the moment. The performance of this model needs to be improved on complex compositional generation tasks. The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities. It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc. Prohibited for pornographic, violent and bloody content generation. Prohibited for error and false information generation. The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.

VACE-Wan2.1-1.3B-Preview

Zeyinzi Jiang · Zhen Han · Chaojie Mao &dagger; · Jingfeng Zhang · Yulin Pan · Yu Liu Introduction VACE is an all-in-one model designed for video creation and editing. It encompasses various tasks, including reference-to-video generation ( R2V ), video-to-video editing ( V2V ), and masked video-to-video editing ( MV2V ), allowing users to compose these tasks freely. This functionality enables users to explore diverse possibilities and streamlines their workflows effectively, offering a range of capabilities, such as Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, Animate-Anything, and more. 🎉 News - [x] Mar 31, 2025: 🔥VACE-Wan2.1-1.3B-Preview and VACE-LTX-Video-0.9 models are now available at HuggingFace and ModelScope! - [x] Mar 31, 2025: 🔥Release code of model inference, preprocessing, and gradio demos. - [x] Mar 11, 2025: We propose VACE, an all-in-one model for video creation and editing. 🪄 Models | Models | Download Link | Video Size | License | |--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------| | VACE-Wan2.1-1.3B-Preview | Huggingface 🤗 ModelScope 🤖 | ~ 81 x 480 x 832 | Apache-2.0 | | VACE-Wan2.1-1.3B | To be released | ~ 81 x 480 x 832 | Apache-2.0 | | VACE-Wan2.1-14B | To be released | ~ 81 x 720 x 1080 | Apache-2.0 | | VACE-LTX-Video-0.9 | Huggingface 🤗 ModelScope 🤖 | ~ 97 x 512 x 768 | RAIL-M | - The input supports any resolution, but to achieve optimal results, the video size should fall within a specific range. - All models inherit the license of the original model. ⚙️ Installation The codebase was tested with Python 3.10.13, CUDA version 12.4, and PyTorch >= 2.5.1. Setup for Model Inference You can setup for VACE model inference by running: Please download your preferred base model to ` /models/`. Setup for Preprocess Tools If you need preprocessing tools, please install: Local Directories Setup It is recommended to download VACE-Benchmark to ` /benchmarks/` as examples in `runvacexxx.sh`. 🚀 Usage In VACE, users can input text prompt and optional video, mask, and image for video generation or editing. Detailed instructions for using VACE can be found in the User Guide. Inference CIL 1) End-to-End Running To simply run VACE without diving into any implementation details, we suggest an end-to-end pipeline. For example: This script will run video preprocessing and model inference sequentially, and you need to specify all the required args of preprocessing (`--task`, `--mode`, `--bbox`, `--video`, etc.) and inference (`--prompt`, etc.). The output video together with intermediate video, mask and images will be saved into `./results/` by default. > 💡Note: > Please refer to runvacepipeline.sh for usage examples of different task pipelines. 2) Preprocessing To have more flexible control over the input, before VACE model inference, user inputs need to be preprocessed into `srcvideo`, `srcmask`, and `srcrefimages` first. We assign each preprocessor a task name, so simply call `vacepreprocess.py` and specify the task name and task params. For example: The outputs will be saved to `./proccessed/` by default. > 💡Note: > Please refer to runvacepipeline.sh preprocessing methods for different tasks. Moreover, refer to vace/configs/ for all the pre-defined tasks and required params. You can also customize preprocessors by implementing at `annotators` and register them at `configs`. 3) Model inference Using the input data obtained from Preprocessing, the model inference process can be performed as follows: The output video together with intermediate video, mask and images will be saved into `./results/` by default. > 💡Note: > (1) Please refer to vace/vacewaninference.py and vace/vaceltxinference.py for the inference args. > (2) For LTX-Video and English language Wan2.1 users, you need prompt extension to unlock the full model performance. Please follow the instruction of Wan2.1 and set `--usepromptextend` while running inference. We are grateful for the following awesome projects, including Scepter, Wan, and LTX-Video. ```bibtex @article{vace, title = {VACE: All-in-One Video Creation and Editing}, author = {Jiang, Zeyinzi and Han, Zhen and Mao, Chaojie and Zhang, Jingfeng and Pan, Yulin and Liu, Yu}, journal = {arXiv preprint arXiv:2503.07598}, year = {2025} }

In-Context-LoRA

license:mit

2,284

630

i2vgen-xl

VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models. This repository includes implementations of the following methods: - I2VGen-xl: High-quality image-to-video synthesis via cascaded diffusion models - VideoComposer: Compositional Video Synthesis with Motion Controllability - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation - [A Recipe for Scaling up Text-to-Video Generation with Text-free Videos]() - [InstructVideo: Instructing Video Diffusion Models with Human Feedback]() - DreamVideo: Composing Your Dream Videos with Customized Subject and Motion - VideoLCM: Video Latent Consistency Model - Modelscope text-to-video technical report VGen can produce high-quality videos from the input text, images, desired motion, desired subjects, and even the feedback signals provided. It also offers a variety of commonly used video generation tools such as visualization, sampling, training, inference, join training using images and videos, acceleration, and more. 🔥News!!! - [2023.12] We release the high-efficiency video generation method VideoLCM - [2023.12] We release the code and model of I2VGen-XL and the ModelScope T2V - [2023.12] We release the T2V method HiGen and customizing T2V method DreamVideo. - [2023.12] We write an introduction docment for VGen and compare I2VGen-XL with SVD. - [2023.11] We release a high-quality I2VGen-XL model, please refer to the Webpage TODO - [x] Release the technical papers and webpage of I2VGen-XL - [x] Release the code and pretrained models that can generate 1280x720 videos - [ ] Release models optimized specifically for the human body and faces - [ ] Updated version can fully maintain the ID and capture large and accurate motions simultaneously - [ ] Release other methods and the corresponding models The main features of VGen are as follows: - Expandability, allowing for easy management of your own experiments. - Completeness, encompassing all common components for video generation. - Excellent performance, featuring powerful pre-trained models in multiple tasks. We have provided a demo dataset that includes images and videos, along with their lists in ``data``. Please note that the demo images used here are for testing purposes and were not included in the training. Executing the following command to enable distributed training is as easy as that. In the `t2vtrain.yaml` configuration file, you can specify the data, adjust the video-to-image ratio using `framelens`, and validate your ideas with different Diffusion settings, and so on. - Before the training, you can download any of our open-source models for initialization. Our codebase supports custom initialization and `gradscale` settings, all of which are included in the `Pretrain` item in yaml file. - During the training, you can view the saved models and intermediate inference results in the `workspace/experiments/t2vtrain`directory. After the training is completed, you can perform inference on the model using the following command. Then you can find the videos you generated in the `workspace/experiments/testimg01` directory. For specific configurations such as data, models, seed, etc., please refer to the `t2vinfer.yaml` file. In a few minutes, you can retrieve the high-definition video you wish to create from the `workspace/experiments/testimg01` directory. At present, we find that the current model performs inadequately on anime images and images with a black background due to the lack of relevant training data. We are consistently working to optimize it. Due to the compression of our video quality in GIF format, please click 'HRER' below to view the original video. Our codebase essentially supports all the commonly used components in video generation. You can manage your experiments flexibly by adding corresponding registration classes, including `ENGINE, MODEL, DATASETS, EMBEDDER, AUTOENCODER, DISTRIBUTION, VISUAL, DIFFUSION, PRETRAIN`, and can be compatible with all our open-source algorithms according to your own needs. If you have any questions, feel free to give us your feedback at any time. I2VGenXL is supported in the 🧨 diffusers library. Here's how to use it: If this repo is useful to you, please cite our corresponding technical paper. This open-source model is trained with using WebVid-10M and LAION-400M datasets and is intended for RESEARCH/NON-COMMERCIAL USE ONLY .

license:mit

1,430

175

modelscope-damo-text-to-video-synthesis

We Are Hiring! (Based in Beijing / Hangzhou, China.) If you're looking for an exciting challenge and the opportunity to work with cutting-edge technologies in AIGC and large-scale pretraining, then we are the place for you. We are looking for talented, motivated and creative individuals to join our team. If you are interested, please send your CV to us. This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported. The text-to-video generation diffusion model consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. The overall model parameters are about 1.7 billion. Support English input. The diffusion model adopts the Unet3D structure, and realizes the function of video generation through the iterative denoising process from the pure Gaussian noise video. This model is meant for research purposes. Please look at the model limitations and biases and misuse, malicious use and excessive use sections. How to expect the model to be used and where it is applicable This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions. The model has been launched on ModelScope Studio and huggingface, you can experience it directly; you can also refer to Colab page to build it yourself. In order to facilitate the experience of the model, users can refer to the Aliyun Notebook Tutorial to quickly develop this Text-to-Video model. This demo requires about 16GB CPU RAM and 16GB GPU RAM. Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input must be in dictionary format, the legal key value is 'text', and the content is a short text. This model currently only supports inference on the GPU. Enter specific code examples as follows: The above code will display the save path of the output video, and the current encoding format can be played normally with VLC player. The output mp4 file can be viewed by VLC media player. Some other media players may not view it normally. The model is trained based on public data sets such as Webvid, and the generated results may have deviations related to the distribution of training data. This model cannot achieve perfect film and television quality generation. The model cannot generate clear text. The model is mainly trained with English corpus and does not support other languages at the moment. The performance of this model needs to be improved on complex compositional generation tasks. The model was not trained to realistically represent people or events, so using it to generate such content is beyond the model's capabilities. It is prohibited to generate content that is demeaning or harmful to people or their environment, culture, religion, etc. Prohibited for pornographic, violent and bloody content generation. Prohibited for error and false information generation. The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.

license:cc-by-nc-4.0

1,282

473

ACE Plus

++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling Chaojie Mao · Jingfeng Zhang · Yulin Pan · Zeyinzi Jiang · Zhen Han The original intention behind the design of ACE++ was to unify reference image generation, local editing, and controllable generation into a single framework, and to enable one model to adapt to a wider range of tasks. A more versatile model is often capable of handling more complex tasks. We have already released three LoRA models, focusing on portraits, objects, and regional editing, with the expectation that each would demonstrate strong adaptability within their respective domains. Undoubtedly, this presents certain challenges. We are currently training a fully fine-tuned model, which has now entered the final stage of quality tuning. We are confident it will be released soon. This model will support a broader range of capabilities and is expected to empower community developers to build even more interesting applications. 📢 News - [x] [2025.01.06] Release the code and models of ACE++. - [x] [2025.01.07] Release the demo on HuggingFace. - [x] [2025.01.16] Release the training code for lora. - [x] [2025.02.15] Collection of workflows in Comfyui. - [x] [2025.02.15] Release the config for fully fine-tuning. - [x] [2025.03.03] Release a unified fft model for ACE++, support more image to image tasks. 🔥The unified fft model for ACE++ Fully finetuning a composite model with ACE’s data to support various editing and reference generation tasks through an instructive approach. We found that there are conflicts between the repainting task and the editing task during the experimental process. This is because the edited image is concatenated with noise in the channel dimension, whereas the repainting task modifies the region using zero pixel values in the VAE's latent space. The editing task uses RGB pixel values in the modified region through the VAE's latent space, which is similar to the distribution of the non-modified part of the repainting task, making it a challenge for the model to distinguish between the two tasks. To address this issue, we introduced 64 additional channels in the channel dimension to differentiate between these two tasks. In these channels, we place the latent representation of the pixel space from the edited image, while keeping other channels consistent with the repainting task. This approach significantly enhances the model's adaptability to different tasks. One issue with this approach is that it changes the input channel number of the FLUX-Fill-Dev model from 384 to 448. The specific configuration can be referenced in the configuration file. Input Reference Image Input Edit Image Input Edit Mask Output Instruction Function "Display the logo in a minimalist style printed in white on a matte black ceramic coffee mug, alongside a steaming cup of coffee on a cozy cafe table." "Subject Consistency Generation" "The item is put on the table." "Subject Consistency Editing" "The logo is printed on the headphones." "Subject Consistency Editing" "{image} features a close-up of a young, furry tiger cub on a rock. The tiger, which appears to be quite young, has distinctive orange, black, and white striped fur, typical of tigers. The cub's eyes have a bright and curious expression, and its ears are perked up, indicating alertness. The cub seems to be in the act of climbing or resting on the rock. The background is a blurred grassland with trees, but the focus is on the cub, which is vividly colored while the rest of the image is in grayscale, drawing attention to the tiger's details. The photo captures a moment in the wild, depicting the charming and tenacious nature of this young tiger, as well as its typical interaction with the environment." "Super-resolution" "{image} Beautiful female portrait, Robot with smooth White transparent carbon shell, rococo detailing, Natural lighting, Highly detailed, Cinematic, 4K." "Recolorizing" "{image} Beautiful female portrait, Robot with smooth White transparent carbon shell, rococo detailing, Natural lighting, Highly detailed, Cinematic, 4K." "Depth Guided Generation" "{image} Beautiful female portrait, Robot with smooth White transparent carbon shell, rococo detailing, Natural lighting, Highly detailed, Cinematic, 4K." "Contour Guided Generation" Comfyui Workflows in community We are deeply grateful to the community developers for building many fascinating applications based on the ACE++ series of models. During this process, we have received valuable feedback, particularly regarding artifacts in generated images and the stability of the results. In response to these issues, many developers have proposed creative solutions, which have greatly inspired us, and we pay tribute to them. At the same time, we will take these concerns into account in our further optimization efforts, carefully evaluating and testing before releasing new models. In the table below, we have briefly listed some workflows for everyone to use. fllux ace++ subject without reference image leeguandong Scepter-ACE++ More convenient replacement of everything HaoBeen Additionally, many bloggers have published tutorials on how to use it, which are listed in the table below. ACE ++ In ComfyUI All-round Creator & Editor - More Than Just A Faceswap AI Ai绘画进阶140-咦？大家用的都不对？！Ace Plus工作流正确搭建方式及逻辑，参数详解，Flux Fill，Redux联用-T8 Comfyui教程 ace++：告别 Lora 训练，无需pulid，轻松打造专属角色！ | No Lora Training, Easily Create Exclusive Characters! Ace++ and Flux Fill: Advanced Face Swapping Made Easy in ComfyUI | No Lora Training, Easily Create Exclusive Characters! 🔥 ACE Models ACE++ provides a comprehensive toolkit for image editing and generation to support various applications. We encourage developers to choose the appropriate model based on their own scenarios and to fine-tune their models using data from their specific scenarios to achieve more stable results. ACE++ Portrait Portrait-consistent generation to maintain the consistency of the portrait. Models' scepterpath: - ModelScope: ms://iic/ACEPlus@portrait/xxxx.safetensors - HuggingFace: hf://ali-vilab/ACEPlus@portrait/xxxx.safetensors ACE++ Subject Subject-driven image generation task to maintain the consistency of a specific subject in different scenes. "Display the logo in a minimalist style printed in white on a matte black ceramic coffee mug, alongside a steaming cup of coffee on a cozy cafe table." Models' scepterpath: - ModelScope: ms://iic/ACEPlus@subject/xxxx.safetensors - HuggingFace: hf://ali-vilab/ACEPlus@subject/xxxx.safetensors ACE++ LocalEditing Redrawing the mask area of images while maintaining the original structural information of the edited area. "By referencing the mask, restore a partial image from the doodle {image} that aligns with the textual explanation: "1 white old owl"." Models' scepterpath: - ModelScope: ms://iic/ACEPlus@localediting/xxxx.safetensors - HuggingFace: hf://ali-vilab/ACEPlus@localediting/xxxx.safetensors 🔥 Applications The ACE++ model supports a wide range of downstream tasks through simple adaptations. Here are some examples, and we look forward to seeing the community explore even more exciting applications utilizing the ACE++ model. ⚙️️ Installation Download the code using the following command: ACE++ depends on FLUX.1-Fill-dev as its base model, which you can download from [](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev). In order to run the inference code or Gradio demo normally, we have defined the relevant environment variables to specify the location of the model. For model preparation, we provide three methods for downloading the model. The summary of relevant settings is as follows. | Model Downloading Method | Clone to Local Path | Automatic Downloading during Runtime (Setting the Environment Variables using scepterpath in ACE Models) | |:-----------------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Environment Variables Setting | export FLUXFILLPATH="path/to/FLUX.1-Fill-dev" export PORTRAITMODELPATH="path/to/ACE++ PORTRAIT PATH" export SUBJECTMODELPATH="path/to/ACE++ SUBJECT PATH" export LOCALMODELPATH="path/to/ACE++ LOCAL EDITING PATH" | export FLUXFILLPATH="hf://black-forest-labs/FLUX.1-Fill-dev" export PORTRAITMODELPATH="${scepterpath}" export SUBJECTMODELPATH="${scepterpath}" export LOCALMODELPATH="${scepterpath}" | 🚀 Inference Under the condition that the environment variables defined in Installation, users can run examples and test your own samples by executing infer.py. The relevant commands are as follows: 🚀 Train We provide training code that allows users to train on their own data. Reference the data in 'data/train.csv' and 'data/eval.csv' to construct the training data and test data, respectively. We use '#;#' to separate fields. The required fields include the following six, with their explanations as follows. All parameters related to training are stored in 'trainconfig/acepluslora.yaml'. To run the training code, execute the following command. The models trained by ACE++ can be found in ./examples/expexample/xxxx/checkpoints/xxxx/0SwiftLoRA/comfyuimodel.safetensors. 💻 Demo We have built a GUI demo based on Gradio to help users better utilize the ACE++ model. Just execute the following command. 📚 Limitations For certain tasks, such as deleting and adding objects, there are flaws in instruction following. For adding and replacing objects, we recommend trying the repainting method of the local editing model to achieve this. The generated results may contain artifacts, especially when it comes to the generation of hands, which still exhibit distortions. The current version of ACE++ is still in the development stage. We are working on improving the model's performance and adding more features. 📝 Citation ACE++ is a post-training model based on the FLUX.1-dev series from black-forest-labs. Please adhere to its open-source license. The test materials used in ACE++ come from the internet and are intended for academic research and communication purposes. If the original creators feel uncomfortable, please contact us to have them removed. If you use this model in your research, please cite the works of FLUX.1-dev and the following papers:

—

289

MS-Image2Video

license:cc-by-nc-nd-4.0

118

ali-vilab

text-to-video-ms-1.7b

VACE-Wan2.1-1.3B-Preview

In-Context-LoRA

i2vgen-xl

modelscope-damo-text-to-video-synthesis

VACE-Annotators

text-to-video-ms-1.7b-legacy

VACE-LTX-Video-0.9

MS-Vid2Vid-XL

ACE Plus

MS-Image2Video