MUG-V
MUG V Inference
--- license: apache-2.0 pipelinetag: image-to-video --- MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models Yongshun Zhang\ · Zhongyi Fan\ · Yonghang Zhang · Zhangzikang Li · Weifeng Chen Zhongwei Feng · Chaoyue Wang† · Peng Hou† · Anxiang Zeng† [](https://arxiv.org/abs/2510.17519) [](https://huggingface.co/MUG-V/MUG-V-inference) [](https://github.com/Shopee-MUG/MUG-V) [](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) [](https://github.com/Shopee-MUG/MUG-V/blob/main/LICENSE) MUG-V 10B is a large-scale video generation system built by the Shopee Multimodal Understanding and Generation (MUG) team. The core generator is a Diffusion Transformer (DiT) with ~10B parameters trained via flow-matching objectives. We release the complete stack: - Model weights - Megatron-Core-based training code - Inference pipelines for video generation and video enhancement To our knowledge, this is the first publicly available large-scale video-generation training framework that leverages Megatron-Core for high training efficiency (e.g., high GPU utilization, strong MFU) and near-linear multi-node scaling. By open-sourcing the end-to-end framework, we aim to accelerate progress and lower the barrier for scalable modeling of the visual world. Oct. 21, 2025: 👋 We are excited to announce the release of the MUG-V 10B technical report. We welcome feedback and discussions. Oct. 21, 2025: 👋 We've released Megatron-LM–based training framework addressing the key challenges of training billion-parameter video generators. Oct. 21, 2025: 👋 We've released MUG-V video enhancement inference code and weights (based on WAN-2.1 1.3B). Oct. 21, 2025: 👋 We've released MUG-V 10B (e-commerce edition) inference code and weights. Apr. 25, 2025: 👋 We submitted our model to Vbench-I2V leaderboard, at submission time, MUG-V ranked #3. - MUG-V Model & Inference - [x] Inference code for MUG-V 10B - [x] Checkpoints: e-commerce edition (Image-to-Video Generation, I2V) - [ ] Checkpoints: general-domain edition - [ ] Diffusers integration - [ ] Text prompt rewriter - MUG-V Training - [x] Data preprocessing tools (video encoding, text encoding) - [x] Pre-training framework on Megatron-LM - MUG-V Video Enhancer - [x] Inference code - [x] Light-weight I2V model Checkpoints (trained on WAN-2.1 1.3B T2V model) - [x] UG-V Video Enhancer LoRA Checkpoints (based on above I2V model) - [ ] Training code - High-quality video generation: up to 720p, 3–5 s clips - Image-to-Video (I2V): conditioning on a reference image - Flexible aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16 - Advanced architecture: MUG-DiT (≈10B parameters) with flow-matching training - Installation - Quick Start - API Reference - Video Enhancement - Model Architecture - License - Python ≥ 3.8 (tested with 3.10) - CUDA 12.1 - NVIDIA GPU with ≥ 24 GB VRAM (for 10B-parameter inference) You need to download the pre-trained models by huggingface-cli Update the vae and dit model paths in your configuration at inferpipeline.MUGDiTConfig. The script will use the default configuration and generate a video based on the built-in prompt and reference image. Use the MUG-V Video Enhancer to improve videos generated by MUG-DiT-10B (e.g., detail restoration, temporal consistency). Details can be find in the ./mugenhancer folder. The output video will be saved to `./mugenhancer/videooutputs/year-month-dayhour\:minute\:second/0000generatedvideoenhance.mp4`. Parameters: - `device` (str): Device to run inference on. Default: "cuda" - `dtype` (torch.dtype): Data type for computations. Default: torch.bfloat16 - `vaepretrainedpath` (str): Path to VAE model checkpoint - `ditpretrainedpath` (str): Path to DiT model checkpoint - `resolution` (str): Video resolution. Currently only "720p" is supported - `videolength` (str): Video duration. Options: "3s", "5s" - `videoarratio` (str): Aspect ratio. Options: "16:9", "4:3", "1:1", "3:4", "9:16" - `cfgscale` (float): Classifier-free guidance scale. Default: 4.0 - `numsamplingsteps` (int): Number of denoising steps. Default: 25 - `fps` (int): Frames per second. Default: 30 - `aesscore` (float): Aesthetic score for prompt enhancement. Default: 6.0 - `seed` (int): Random seed for reproducibility. Default: 42 `init(config: MUGDiTConfig)` Initialize the pipeline with configuration. `generate(prompt=None, referenceimagepath=None, outputpath=None, seed=None, kwargs) -> str` Generate a video from text and reference image. Parameters: - `prompt` (str, optional): Text description of desired video - `referenceimagepath` (str|Path, optional): Path to reference image - `outputpath` (str|Path, optional): Output video file path - `seed` (int, optional): Random seed for this generation MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matching objectives: 1. VideoVAE: 8×8×8 spatiotemporal compression - Encoder: 3D convolutions + temporal attention - Decoder: 3D transposed convolutions + temporal upsampling - KL regularization for stable latent space 2. 3D Patch Embedding: Converts video latents to tokens - Patch size: 2×2×2 (non-overlapping) - Final compression: ~2048× vs. pixel space 3. Position Encoding: 3D Rotary Position Embeddings (RoPE) - Extends 2D RoPE to handle temporal dimension - Frequency-based encoding for spatiotemporal modeling 4. Conditioning Modules: - Caption Embedder: Projects text embeddings (4096-dim) for cross-attention - Timestep Embedder: Embeds diffusion timestep via sinusoidal encoding - Size Embedder: Handles variable resolution inputs 6. Rectified Flow Scheduler: - More stable training than DDPM - Logit-normal timestep sampling - Linear interpolation between noise and data Citation If you find our work helpful, please cite us. This project is licensed under the Apache License 2.0 - see the LICENSE file for details. Note: This is a research project. Generated content may not always be perfect. Please use responsibly and in accordance with applicable laws and regulations. We would like to thank the contributors to the Open-Sora, DeepFloyd/t5-v11-xxl, Wan-Video, Qwen, HuggingFace, Megatron-LM, TransformerEngine, DiffSynth, diffusers, PixArt, etc. repositories, for their open research.
MUG-V-training
Pre-trained Megatron-format checkpoints for MUG-V 10B video generation model. Torch Distributed Checkpoint - Flexible parallelism support - Format: Torch Distributed (`.distcp`) - Parallelism: Can be loaded with any TP/PP configuration - Use Case: Production training, flexible distributed setup - Format: Torch format (`mprankXX/modeloptimrng.pt`) - Parallelism: Must be loaded with TP=4 - Use Case: Fixed TP setup or conversion to Torch Distributed Use the Torch Distributed checkpoint directly for training: Convert Megatron checkpoint to HuggingFace format for inference: | Format | Parallelism | File Structure | Training | Conversion | |--------|-------------|----------------|----------|------------| | Torch Distributed | Flexible TP/PP | `.distcp` files | ✅ Recommended | ✅ To HF | | Torch (Legacy) | Fixed TP=4 | `mprankXX/` dirs | ⚠️ TP=4 only | ✅ To Torch Dist / HF | | HuggingFace | None (inference) | Single `.pt` file | ❌ Not for training | - | - Parameters: ~10 billion - Architecture: Diffusion Transformer (DiT) - Hidden Size: 3456 - Attention Heads: 48 - Layers: 56 - Compression: VideoVAE 8×8×8 - Training Code: MUG-V-Megatron-LM-Training - Inference Code: MUG-V - Inference Weights: MUG-V-inference - Sample Dataset: MUG-V-Training-Samples - Training Guide: examples/mugv/README.md - Checkpoint Conversion: Conversion Guide Developed by Shopee Multimodal Understanding and Generation (MUG) Team