QingyanBai

2 models • 2 total models in database
Sort by:

Ditto Models

Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset This repository contains the Ditto framework and the Editto model, which are introduced in the paper Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset. Ditto provides a holistic approach to address the scarcity of high-quality training data for instruction-based video editing, enabling the creation of the Ditto-1M dataset and the training of the state-of-the-art Editto model. - 📄 Paper - 🌐 Project Page - 💻 GitHub Repository - 📦 Model Weights (on HF) - 📊 Dataset (on HF) Abstract Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing. Download the base model and our models from Google Drive or Hugging Face: You can either use the provided script or run Python directly: Some test cases could be found at HF Dataset. You can also find some reference editing prompts in `inference/exampleprompts.txt`. 2. Using with ComfyUI Note: While ComfyUI runs faster with lower computational requirements (832×480x73 videos need 11G GPU memory and ~4min on A6000), please note that due to the use of quantized and distilled models, there may be some quality degradation. First, follow the ComfyUI installation guide to set up the base ComfyUI environment. We strongly recommend installing ComfyUI-Manager for easy custom node management: Option 1 (Recommended): Use ComfyUI-Manager to automatically install all required custom nodes with the function Install Missing Custom Nodes. Option 2: Manually install the required custom nodes (you can refer to this page): - ComfyUI-WanVideoWrapper - KJNodes for ComfyUI - comfyui-mixlab-nodes - ComfyUI-VideoHelperSuite Download the required model weights from: Kijai/WanVideocomfy to subfolders of `models/`. Required files include: - Wan21-T2V-14Bfp8e4m3fn.safetensors to `diffusionmodels/` - Wan21CausVid14BT2Vlorarank32v2.safetensors to `loras/` for inference acceleration - Wan21VAEbf16.safetensors to `vae/wan/` - umt5-xxl-enc-bf16.safetensors to `textencoders/` Download our models from Google Drive or Hugging Face to `diffusionmodels/` (use VACE Module Select node for loading). Use the workflow `dittocomfyuiworkflow.json` in this repo to get started. We provided some reference prompts in the note. Some test cases could be found at HF Dataset. Note: If you want to test sim2real cases, you can try prompts like 'Turn it into the real domain'. If you find this work useful, please consider citing our paper: We thank Wan & VACE & Qwen-Image for providing the powerful foundation model, and QwenVL for the advanced visual understanding capabilities. We also thank DiffSynth-Studio serving as the codebase for this repository. This project is licensed under the CC BY-NC-SA 4.0(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License). The code is provided for academic research purposes only. For any questions, please contact [email protected].

NaNK
license:cc-by-nc-sa-4.0
0
62

Ditto

NaNK
0
3