Skywork

78 models β€’ 9 total models in database
Sort by:

Skywork-R1V-38B

[](https://github.com/SkyworkAI/Skywork-R1V/stargazers) [](https://github.com/SkyworkAI/Skywork-R1V/fork) 1. Model Introduction | Model Name | Vision Encoder | Language Model | HF Link | | ---------------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | ------------ | | Skywork-R1V-38B | InternViT-6B-448px-V25 | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | πŸ€— Link | | Skywork-R1V2-38B | InternViT-6B-448px-V25 | Qwen/QwQ-32B | πŸ€— Link | 2. Feature - Visual Chain-of-Thought: Enables multi-step logical reasoning on visual inputs, breaking down complex image-based problems into manageable steps. - Mathematical & Scientific Analysis: Capable of solving visual math problems and interpreting scientific/medical imagery with high precision. - Cross-Modal Understanding: Seamlessly integrates text and images for richer, context-aware comprehension. Comparison with Larger-Scale Open-Source and Closed-Source Models QwQ-32B-Preview InternVL-2.5-38B VILA 1.5-40B InternVL2-40B Skywork-R1V-38B Evaluation results of state-of-the-art LLMs and VLMs 5. Citation If you use Skywork-R1V in your research, please cite: This project is released under an open-source license. Star History [](https://www.star-history.com/#SkyworkAI/Skywork-R1V&Date) ```

NaNK
license:mit
49,043
127

Skywork-Reward-Llama-3.1-8B-v0.2

> IMPORTANT: > This model was trained using the decontaminated version of the original Skywork Reward Preference dataset, now referred to as v0.2. The updated dataset, Skywork-Reward-Preference-80K-v0.2, removes 4,957 contaminated pairs from the magpie-ultra-v0.1 subset, which had significant n-gram overlap with the evaluation prompts in RewardBench. You can find the set of removed pairs here. For more detailed information, please refer to this GitHub gist. > > If your task involves evaluation on RewardBench, we strongly recommend using v0.2 of both the dataset and the models instead of v0.1, to ensure proper decontamination and avoid any contamination issues. Skywork-Reward-Gemma-2-27B-v0.2 and Skywork-Reward-Llama-3.1-8B-v0.2 are two advanced reward models built on the gemma-2-27b-it and Llama-3.1-8B-Instruct architectures, respectively. Both models were trained using the Skywork Reward Data Collection containing only 80K high-quality preference pairs sourced from publicly available data. We include only public data in an attempt to demonstrate that high-performance reward models can be achieved with a relatively small dataset and straightforward data curation techniques, without further algorithmic or architectural modifications. The sources of data used in the Skywork Reward Data Collection are detailed in the Data Mixture section below. The resulting reward models excel at handling preferences in complex scenarios, including challenging preference pairs, and span various domains such as mathematics, coding, and safety. Instead of relying on existing large preference datasets, we carefully curate the Skywork Reward Data Collection (1) to include high-quality preference pairs and (2) to target specific capability and knowledge domains. The curated training dataset consists of approximately 80K samples, subsampled from multiple publicly available data sources, including 1. HelpSteer2 2. OffsetBias 3. WildGuard (adversarial) 4. Magpie DPO series: Ultra, Pro (Llama-3.1), Pro, Air. Disclaimer: We made no modifications to the original datasets listed above, other than subsampling the datasets to create the Skywork Reward Data Collection. During dataset curation, we adopt several tricks to achieve both performance improvement and a balance between each domain, without compromising the overall performance: 1. We select top samples from math, code, and other categories in the combined Magpie dataset independently, based on the average ArmoRM score provided with the dataset. We subtract the ArmoRM average scores in the Magpie-Air subset and the Magpie-Pro subset by 0.1 and 0.05, respectively, to prioritize Magpie-Ultra and Magpie-Pro-Llama-3.1 samples. 2. Instead of including all preference pairs in WildGuard, we first train a reward model (RM) on three other data sources. We then (1) use this RM to score the chosen and rejected responses for all samples in WildGuard and (2) select only samples where the chosen response's RM score is greater than the rejected response's RM score. We observe that this approach largely preserves the original performance of Chat, Char hard, and Reasoning while improving Safety. For both models, we use the 27B model to score the WildGuard samples. We evaluate our model on RewardBench using the official test script. As of October 2024, Skywork-Reward-Llama-3.1-8B-v0.2 ranks first among 8B models on the RewardBench leaderboard. | Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning | | :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: | | 1 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 | | 2 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 | | 3 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 | | 4 | SF-Foundation/TextEval-Llama3.1-70B | Generative | 93.5 | 94.1 | 90.1 | 93.2 | 96.4 | | 5 | meta-metrics/MetaMetrics-RM-v1.0 | Custom Classifier | 93.4 | 98.3 | 86.4 | 90.8 | 98.2 | | 6 | Skywork/Skywork-Critic-Llama-3.1-70B | Generative | 93.3 | 96.6 | 87.9 | 93.1 | 95.5 | | 7 | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Seq. Classifier | 93.1 | 94.7 | 88.4 | 92.7 | 96.7 | | 8 | nicolinho/QRM-Llama3.1-8B | Seq. Classifier | 93.1 | 94.4 | 89.7 | 92.3 | 95.8 | | 9 | LxzGordon/URM-LLaMa-3.1-8B | Seq. Classifier | 92.9 | 95.5 | 88.2 | 91.1 | 97.0 | | 10 | Salesforce/SFR-LLaMa-3.1-70B-Judge-r | Generative | 92.7 | 96.9 | 84.8 | 91.6 | 97.6 | | 11 | Skywork/Skywork-Reward-Llama-3.1-8B | Seq. Classifier | 92.5 | 95.8 | 87.3 | 90.8 | 96.2 | | 12 | general-preference/GPM-Llama-3.1-8B | Custom Classifier | 92.2 | 93.3 | 88.6 | 91.1 | 96.0 | We provide example usage of the Skywork reward model series below. Please note that: 1. To enable optimal performance for the 27B reward model, ensure that you have enabled either the `flashattention2` or `eager` implementation. The default `spda` implementation may result in bugs that could significantly degrade performance for this particular model. Below is an example of obtaining the reward scores of two conversations. We hereby declare that the Skywork model should not be used for any activities that pose a threat to national or societal security or engage in unlawful actions. Additionally, we request users not to deploy the Skywork model for internet services without appropriate security reviews and records. We hope that all users will adhere to this principle to ensure that technological advancements occur in a regulated and lawful environment. We have done our utmost to ensure the compliance of the data used during the model's training process. However, despite our extensive efforts, due to the complexity of the model and data, there may still be unpredictable risks and issues. Therefore, if any problems arise as a result of using the Skywork open-source model, including but not limited to data security issues, public opinion risks, or any risks and problems arising from the model being misled, abused, disseminated, or improperly utilized, we will not assume any responsibility. The community usage of Skywork model requires Skywork Community License. The Skywork model supports commercial use. If you plan to use the Skywork model or its derivatives for commercial purposes, you must abide by terms and conditions within Skywork Community License. Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs If you have any questions, please feel free to reach us at or . If you find our work helpful, please feel free to cite us using the following BibTeX entry:

NaNK
llama
24,320
37

Skywork-Reward-V2-Qwen3-4B

NaNK
license:apache-2.0
13,319
8

Skywork-Reward-V2-Llama-3.2-1B

NaNK
llama
11,919
5

Skywork-Reward-V2-Llama-3.1-8B

NaNK
llama
11,236
22

Skywork-Reward-V2-Qwen3-8B

NaNK
license:apache-2.0
6,877
18

Skywork-o1-Open-PRM-Qwen-2.5-1.5B

NaNK
β€”
3,324
33

Skywork-Reward-V2-Qwen3-0.6B

NaNK
license:apache-2.0
2,990
11

SkyReels-V2-DF-1.3B-540P

πŸ“‘ Technical Report Β· πŸ‘‹ Playground Β· πŸ’¬ Discord Β· πŸ€— Hugging Face Β· πŸ€– ModelScope Β· 🌐 GitHub --- Welcome to the SkyReels V2 repository! Here, you'll find the model weights for our infinite-length film generative models. To the best of our knowledge, it represents the first open-source video generative model employing AutoRegressive Diffusion-Forcing architecture that achieves the SOTA performance among publicly available models. πŸ”₯πŸ”₯πŸ”₯ News!! Apr 24, 2025: πŸ”₯ We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis. Apr 21, 2025: πŸ‘‹ We release the inference code and model weights of SkyReels-V2 Series Models and the video captioning model SkyCaptioner-V1 . Apr 3, 2025: πŸ”₯ We also release SkyReels-A2. This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements. Feb 18, 2025: πŸ”₯ we released SkyReels-A1. This is an open-sourced and effective framework for portrait image animation. Feb 18, 2025: πŸ”₯ We released SkyReels-V1. This is the first and most advanced open-source human-centric video foundation model. The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model. - [x] Technical Report - [x] Checkpoints of the 14B and 1.3B Models Series - [x] Single-GPU & Multi-GPU Inference Code - [x] SkyCaptioner-V1 : A Video Captioning Model - [x] Prompt Enhancer - [ ] Diffusers integration - [ ] Checkpoints of the 5B Models Series - [ ] Checkpoints of the Camera Director Models - [ ] Checkpoints of the Step & Guidance Distill Model Model Download You can download our models from Hugging Face: Type Model Variant Recommended Height/Width/Frame Link Diffusion Forcing 1.3B-540P 544 960 97f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope Image-to-Video 1.3B-540P 544 960 97f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope After downloading, set the model path in your generation commands: The Diffusion Forcing version model allows us to generate Infinite-Length videos. This model supports both text-to-video (T2V) and image-to-video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first. > Note: > - If you want to run the image-to-video (I2V) task, add `--image ${imagepath}` to your command and it is also better to use text-to-video (T2V)-like prompt which includes some descriptions of the first-frame image. > - For long video generation, you can just switch the `--numframes`, e.g., `--numframes 257` for 10s video, `--numframes 377` for 15s video, `--numframes 737` for 30s video, `--numframes 1457` for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causalblocksize > 1, the `--numframes` should be carefully set. > - You can use `--arstep 5` to enable asynchronous inference. When asynchronous inference, `--causalblocksize 5` is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97-1)//4+1=25 for basenumframes=97) and (e.g., (237-97-(97-17)x1+17-1)//4+1=20 for basenumframes=97, numframes=237, overlaphistory=17) for the last iteration, MUST be divided by causalblocksize. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance. > - To reduce peak VRAM, just lower the `--basenumframes`, e.g., to 77 or 57, while keeping the same generative length `--numframes` you want to generate. This may slightly reduce video quality, and it should not be set too small. > - `--addnoisecondition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50. > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM. > Note: > - When using an image-to-video (I2V) model, you must provide an input image using the `--image ${imagepath}` parameter. The `--guidancescale 5.0` and `--shift 3.0` is recommended for I2V model. > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM. The prompt enhancer is implemented based on Qwen2.5-32B-Instruct and is utilized via the `--promptenhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if you use `--promptenhancer`. If you want to obtain the enhanced prompt separately, you can also run the promptenhancer script separately for testing. The steps are as follows: > Note: > - `--promptenhancer` is not allowed if using `--useusp`. We recommend running the skyreelsv2infer/pipelines/promptenhancer.py script first to generate enhanced prompt before enabling the `--useusp` parameter. Below are the key parameters you can customize for video generation: | Parameter | Recommended Value | Description | |:----------------------:|:---------:|:-----------------------------------------:| | --prompt | | Text description for generating your video | | --image | | Path to input image for image-to-video generation | | --resolution | 540P or 720P | Output video resolution (select based on model type) | | --numframes | 97 or 121 | Total frames to generate (97 for 540P models, 121 for 720P models) | | --inferencesteps | 50 | Number of denoising steps | | --fps | 24 | Frames per second in the output video | | --shift | 8.0 or 5.0 | Flow matching scheduler parameter (8.0 for T2V, 5.0 for I2V) | | --guidancescale | 6.0 or 5.0 | Controls text adherence strength (6.0 for T2V, 5.0 for I2V) | | --seed | | Fixed seed for reproducible results (omit for random generation) | | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) | | --useusp | True | Enables multi-GPU acceleration with xDiT USP | | --outdir | ./videoout | Directory where generated videos will be saved | | --promptenhancer | True | Expand the prompt into a more detailed description | | --teacache | False | Enables teacache for faster inference | | --teacachethresh | 0.2 | Higher speedup will cause to worse quality | | --useretsteps | False | Retention Steps for teacache | Diffusion Forcing Additional Parameters | Parameter | Recommended Value | Description | |:----------------------:|:---------:|:-----------------------------------------:| | --arstep | 0 | Controls asynchronous inference (0 for synchronous mode) | | --basenumframes | 97 or 121 | Base frame count (97 for 540P, 121 for 720P) | | --overlaphistory | 17 | Number of frames to overlap for smooth transitions in long videos | | --addnoisecondition | 20 | Improves consistency in long video generation | | --causalblocksize | 5 | Recommended when using asynchronous inference (--arstep > 0) | We use xDiT USP to accelerate inference. For example, to generate a video with 2 GPUs, you can use the following command: - Diffusion Forcing > Note: > - When using an image-to-video (I2V) model, you must provide an input image using the `--image ${imagepath}` parameter. The `--guidancescale 5.0` and `--shift 3.0` is recommended for I2V model. Contents - Abstract - Methodology of SkyReels-V2 - Key Contributions of SkyReels-V2 - Video Captioner - Reinforcement Learning - Diffusion Forcing - High-Quality Supervised Fine-Tuning(SFT) - Performance - Acknowledgements - Citation --- Abstract Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we introduce SkyReels-V2, the world's first infinite-length film generative model using a Diffusion Forcing framework. Our approach synergizes Multi-modal Large Language Models (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization. Beyond its technical innovations, SkyReels-V2 enables multiple practical applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and multi-subject consistent video generation through our Skyreels-A2 system. The SkyReels-V2 methodology consists of several interconnected components. It starts with a comprehensive data processing pipeline that prepares various quality training data. At its core is the Video Captioner architecture, which provides detailed annotations for video content. The system employs a multi-task pretraining strategy to build fundamental video generation capabilities. Post-training optimization includes Reinforcement Learning to enhance motion quality, Diffusion Forcing Training for generating extended videos, and High-quality Supervised Fine-Tuning (SFT) stages for visual refinement. The model runs on optimized computational infrastructure for efficient training and inference. SkyReels-V2 supports multiple applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and Elements-to-Video Generation. SkyCaptioner-V1 serves as our video captioning model for data annotation. This model is trained on the captioning result from the base model Qwen2.5-VL-72B-Instruct and the sub-expert captioners on a balanced video data. The balanced video data is a carefully curated dataset of approximately 2 million videos to ensure conceptual balance and annotation quality. Built upon the Qwen2.5-VL-7B-Instruct foundation model, SkyCaptioner-V1 is fine-tuned to enhance performance in domain-specific video captioning tasks. To compare the performance with the SOTA models, we conducted a manual assessment of accuracy across different captioning fields using a test set of 1,000 samples. The proposed SkyCaptioner-V1 achieves the highest average accuracy among the baseline models, and show a dramatic result in the shot related fields model Qwen2.5-VL-7B-Ins. Qwen2.5-VL-72B-Ins. Tarsier2-Recap-7b SkyCaptioner-V1 Reinforcement Learning Inspired by the previous success in LLM, we propose to enhance the performance of the generative model by Reinforcement Learning. Specifically, we focus on the motion quality because we find that the main drawback of our generative model is: - the generative model does not handle well with large, deformable motions. - the generated videos may violate the physical law. To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairs and human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motion quality of the generative model. We introduce the Diffusion Forcing Transformer to unlock our model’s ability to generate long videos. Diffusion Forcing is a training and sampling strategy where each token is assigned an independent noise level. This allows tokens to be denoised according to arbitrary, per-token schedules. Conceptually, this approach functions as a form of partial masking: a token with zero noise is fully unmasked, while complete noise fully masks it. Diffusion Forcing trains the model to "unmask" any combination of variably noised tokens, using the cleaner tokens as conditional information to guide the recovery of noisy ones. Building on this, our Diffusion Forcing Transformer can extend video generation indefinitely based on the last frames of the previous segment. Note that the synchronous full sequence diffusion is a special case of Diffusion Forcing, where all tokens share the same noise level. This relationship allows us to fine-tune the Diffusion Forcing Transformer from a full-sequence diffusion model. We implement two sequential high-quality supervised fine-tuning (SFT) stages at 540p and 720p resolutions respectively, with the initial SFT phase conducted immediately after pretraining but prior to reinforcement learning (RL) stage.This first-stage SFT serves as a conceptual equilibrium trainer, building upon the foundation model’s pretraining outcomes that utilized only fps24 video data, while strategically removing FPS embedding components to streamline thearchitecture. Trained with the high-quality concept-balanced samples, this phase establishes optimized initialization parameters for subsequent training processes. Following this, we execute a secondary high-resolution SFT at 720p after completing the diffusion forcing stage, incorporating identical loss formulations and the higher-quality concept-balanced datasets by the manually filter. This final refinement phase focuses on resolution increase such that the overall video quality will be further enhanced. To comprehensively evaluate our proposed method, we construct the SkyReels-Bench for human assessment and leveraged the open-source V-Bench for automated evaluation. This allows us to compare our model with the state-of-the-art (SOTA) baselines, including both open-source and proprietary models. For human evaluation, we design SkyReels-Bench with 1,020 text prompts, systematically assessing three dimensions: Instruction Adherence, Motion Quality, Consistency and Visual Quality. This benchmark is designed to evaluate both text-to-video (T2V) and image-to-video (I2V) generation models, providing comprehensive assessment across different generation paradigms. To ensure fairness, all models were evaluated under default settings with consistent resolutions, and no post-generation filtering was applied. Model Name Average Instruction Adherence Consistency Visual Quality Motion Quality The evaluation demonstrates that our model achieves significant advancements in instruction adherence (3.15) compared to baseline methods, while maintaining competitive performance in motion quality (2.74) without sacrificing the consistency (3.35). Model Average Instruction Adherence Consistency Visual Quality Motion Quality Our results demonstrate that both SkyReels-V2-I2V (3.29) and SkyReels-V2-DF (3.24) achieve state-of-the-art performance among open-source models, significantly outperforming HunyuanVideo-13B (2.84) and Wan2.1-14B (2.85) across all quality dimensions. With an average score of 3.29, SkyReels-V2-I2V demonstrates comparable performance to proprietary models Kling-1.6 (3.4) and Runway-Gen4 (3.39). VBench To objectively compare SkyReels-V2 Model against other leading open-source Text-To-Video models, we conduct comprehensive evaluations using the public benchmark V-Bench . Our evaluation specifically leverages the benchmark’s longer version prompt. For fair comparison with baseline models, we strictly follow their recommended setting for inference. The VBench results demonstrate that SkyReels-V2 outperforms all compared models including HunyuanVideo-13B and Wan2.1-14B, With the highest total score (83.9%) and quality score (84.7%). In this evaluation, the semantic score is slightly lower than Wan2.1-14B, while we outperform Wan2.1-14B in human evaluations, with the primary gap attributed to V-Bench’s insufficient evaluation of shot-scenario semantic adherence. Acknowledgements We would like to thank the contributors of Wan 2.1 , XDit and Qwen 2.5 repositories, for their open research and contributions.

NaNK
β€”
1,652
37

Skywork-o1-Open-Llama-3.1-8B

This model is based on the Meta Llama 3.1 8B Instruct architecture and is designed for text generation. It is licensed under other terms.

NaNK
llama
1,435
114

Skywork-OR1-7B

NaNK
β€”
1,297
13

Skywork-Reward-V2-Qwen3-1.7B

NaNK
license:apache-2.0
913
7

Skywork-Reward-V2-Llama-3.1-8B-40M

NaNK
llama
912
18

Skywork-OR1-32B

NaNK
β€”
841
18

SkyReels V2 DF 14B 720P

πŸ“‘ Technical Report Β· πŸ‘‹ Playground Β· πŸ’¬ Discord Β· πŸ€— Hugging Face Β· πŸ€– ModelScope Β· 🌐 GitHub --- Welcome to the SkyReels V2 repository! Here, you'll find the model weights for our infinite-length film generative models. To the best of our knowledge, it represents the first open-source video generative model employing AutoRegressive Diffusion-Forcing architecture that achieves the SOTA performance among publicly available models. πŸ”₯πŸ”₯πŸ”₯ News!! Apr 24, 2025: πŸ”₯ We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis. Apr 21, 2025: πŸ‘‹ We release the inference code and model weights of SkyReels-V2 Series Models and the video captioning model SkyCaptioner-V1 . Apr 3, 2025: πŸ”₯ We also release SkyReels-A2. This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements. Feb 18, 2025: πŸ”₯ we released SkyReels-A1. This is an open-sourced and effective framework for portrait image animation. Feb 18, 2025: πŸ”₯ We released SkyReels-V1. This is the first and most advanced open-source human-centric video foundation model. The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model. - [x] Technical Report - [x] Checkpoints of the 14B and 1.3B Models Series - [x] Single-GPU & Multi-GPU Inference Code - [x] SkyCaptioner-V1 : A Video Captioning Model - [x] Prompt Enhancer - [ ] Diffusers integration - [ ] Checkpoints of the 5B Models Series - [ ] Checkpoints of the Camera Director Models - [ ] Checkpoints of the Step & Guidance Distill Model Model Download You can download our models from Hugging Face: Type Model Variant Recommended Height/Width/Frame Link Diffusion Forcing 1.3B-540P 544 960 97f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope Image-to-Video 1.3B-540P 544 960 97f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope After downloading, set the model path in your generation commands: The Diffusion Forcing version model allows us to generate Infinite-Length videos. This model supports both text-to-video (T2V) and image-to-video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first. > Note: > - If you want to run the image-to-video (I2V) task, add `--image ${imagepath}` to your command and it is also better to use text-to-video (T2V)-like prompt which includes some descriptions of the first-frame image. > - For long video generation, you can just switch the `--numframes`, e.g., `--numframes 257` for 10s video, `--numframes 377` for 15s video, `--numframes 737` for 30s video, `--numframes 1457` for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causalblocksize > 1, the `--numframes` should be carefully set. > - You can use `--arstep 5` to enable asynchronous inference. When asynchronous inference, `--causalblocksize 5` is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97-1)//4+1=25 for basenumframes=97) and (e.g., (237-97-(97-17)x1+17-1)//4+1=20 for basenumframes=97, numframes=237, overlaphistory=17) for the last iteration, MUST be divided by causalblocksize. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance. > - To reduce peak VRAM, just lower the `--basenumframes`, e.g., to 77 or 57, while keeping the same generative length `--numframes` you want to generate. This may slightly reduce video quality, and it should not be set too small. > - `--addnoisecondition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50. > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM. > Note: > - When using an image-to-video (I2V) model, you must provide an input image using the `--image ${imagepath}` parameter. The `--guidancescale 5.0` and `--shift 3.0` is recommended for I2V model. > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM. The prompt enhancer is implemented based on Qwen2.5-32B-Instruct and is utilized via the `--promptenhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if you use `--promptenhancer`. If you want to obtain the enhanced prompt separately, you can also run the promptenhancer script separately for testing. The steps are as follows: > Note: > - `--promptenhancer` is not allowed if using `--useusp`. We recommend running the skyreelsv2infer/pipelines/promptenhancer.py script first to generate enhanced prompt before enabling the `--useusp` parameter. Below are the key parameters you can customize for video generation: | Parameter | Recommended Value | Description | |:----------------------:|:---------:|:-----------------------------------------:| | --prompt | | Text description for generating your video | | --image | | Path to input image for image-to-video generation | | --resolution | 540P or 720P | Output video resolution (select based on model type) | | --numframes | 97 or 121 | Total frames to generate (97 for 540P models, 121 for 720P models) | | --inferencesteps | 50 | Number of denoising steps | | --fps | 24 | Frames per second in the output video | | --shift | 8.0 or 5.0 | Flow matching scheduler parameter (8.0 for T2V, 5.0 for I2V) | | --guidancescale | 6.0 or 5.0 | Controls text adherence strength (6.0 for T2V, 5.0 for I2V) | | --seed | | Fixed seed for reproducible results (omit for random generation) | | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) | | --useusp | True | Enables multi-GPU acceleration with xDiT USP | | --outdir | ./videoout | Directory where generated videos will be saved | | --promptenhancer | True | Expand the prompt into a more detailed description | | --teacache | False | Enables teacache for faster inference | | --teacachethresh | 0.2 | Higher speedup will cause to worse quality | | --useretsteps | False | Retention Steps for teacache | Diffusion Forcing Additional Parameters | Parameter | Recommended Value | Description | |:----------------------:|:---------:|:-----------------------------------------:| | --arstep | 0 | Controls asynchronous inference (0 for synchronous mode) | | --basenumframes | 97 or 121 | Base frame count (97 for 540P, 121 for 720P) | | --overlaphistory | 17 | Number of frames to overlap for smooth transitions in long videos | | --addnoisecondition | 20 | Improves consistency in long video generation | | --causalblocksize | 5 | Recommended when using asynchronous inference (--arstep > 0) | We use xDiT USP to accelerate inference. For example, to generate a video with 2 GPUs, you can use the following command: - Diffusion Forcing > Note: > - When using an image-to-video (I2V) model, you must provide an input image using the `--image ${imagepath}` parameter. The `--guidancescale 5.0` and `--shift 3.0` is recommended for I2V model. Contents - Abstract - Methodology of SkyReels-V2 - Key Contributions of SkyReels-V2 - Video Captioner - Reinforcement Learning - Diffusion Forcing - High-Quality Supervised Fine-Tuning(SFT) - Performance - Acknowledgements - Citation --- Abstract Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we introduce SkyReels-V2, the world's first infinite-length film generative model using a Diffusion Forcing framework. Our approach synergizes Multi-modal Large Language Models (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization. Beyond its technical innovations, SkyReels-V2 enables multiple practical applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and multi-subject consistent video generation through our Skyreels-A2 system. The SkyReels-V2 methodology consists of several interconnected components. It starts with a comprehensive data processing pipeline that prepares various quality training data. At its core is the Video Captioner architecture, which provides detailed annotations for video content. The system employs a multi-task pretraining strategy to build fundamental video generation capabilities. Post-training optimization includes Reinforcement Learning to enhance motion quality, Diffusion Forcing Training for generating extended videos, and High-quality Supervised Fine-Tuning (SFT) stages for visual refinement. The model runs on optimized computational infrastructure for efficient training and inference. SkyReels-V2 supports multiple applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and Elements-to-Video Generation. SkyCaptioner-V1 serves as our video captioning model for data annotation. This model is trained on the captioning result from the base model Qwen2.5-VL-72B-Instruct and the sub-expert captioners on a balanced video data. The balanced video data is a carefully curated dataset of approximately 2 million videos to ensure conceptual balance and annotation quality. Built upon the Qwen2.5-VL-7B-Instruct foundation model, SkyCaptioner-V1 is fine-tuned to enhance performance in domain-specific video captioning tasks. To compare the performance with the SOTA models, we conducted a manual assessment of accuracy across different captioning fields using a test set of 1,000 samples. The proposed SkyCaptioner-V1 achieves the highest average accuracy among the baseline models, and show a dramatic result in the shot related fields model Qwen2.5-VL-7B-Ins. Qwen2.5-VL-72B-Ins. Tarsier2-Recap-7b SkyCaptioner-V1 Reinforcement Learning Inspired by the previous success in LLM, we propose to enhance the performance of the generative model by Reinforcement Learning. Specifically, we focus on the motion quality because we find that the main drawback of our generative model is: - the generative model does not handle well with large, deformable motions. - the generated videos may violate the physical law. To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairs and human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motion quality of the generative model. We introduce the Diffusion Forcing Transformer to unlock our model’s ability to generate long videos. Diffusion Forcing is a training and sampling strategy where each token is assigned an independent noise level. This allows tokens to be denoised according to arbitrary, per-token schedules. Conceptually, this approach functions as a form of partial masking: a token with zero noise is fully unmasked, while complete noise fully masks it. Diffusion Forcing trains the model to "unmask" any combination of variably noised tokens, using the cleaner tokens as conditional information to guide the recovery of noisy ones. Building on this, our Diffusion Forcing Transformer can extend video generation indefinitely based on the last frames of the previous segment. Note that the synchronous full sequence diffusion is a special case of Diffusion Forcing, where all tokens share the same noise level. This relationship allows us to fine-tune the Diffusion Forcing Transformer from a full-sequence diffusion model. We implement two sequential high-quality supervised fine-tuning (SFT) stages at 540p and 720p resolutions respectively, with the initial SFT phase conducted immediately after pretraining but prior to reinforcement learning (RL) stage.This first-stage SFT serves as a conceptual equilibrium trainer, building upon the foundation model’s pretraining outcomes that utilized only fps24 video data, while strategically removing FPS embedding components to streamline thearchitecture. Trained with the high-quality concept-balanced samples, this phase establishes optimized initialization parameters for subsequent training processes. Following this, we execute a secondary high-resolution SFT at 720p after completing the diffusion forcing stage, incorporating identical loss formulations and the higher-quality concept-balanced datasets by the manually filter. This final refinement phase focuses on resolution increase such that the overall video quality will be further enhanced. To comprehensively evaluate our proposed method, we construct the SkyReels-Bench for human assessment and leveraged the open-source V-Bench for automated evaluation. This allows us to compare our model with the state-of-the-art (SOTA) baselines, including both open-source and proprietary models. For human evaluation, we design SkyReels-Bench with 1,020 text prompts, systematically assessing three dimensions: Instruction Adherence, Motion Quality, Consistency and Visual Quality. This benchmark is designed to evaluate both text-to-video (T2V) and image-to-video (I2V) generation models, providing comprehensive assessment across different generation paradigms. To ensure fairness, all models were evaluated under default settings with consistent resolutions, and no post-generation filtering was applied. Model Name Average Instruction Adherence Consistency Visual Quality Motion Quality The evaluation demonstrates that our model achieves significant advancements in instruction adherence (3.15) compared to baseline methods, while maintaining competitive performance in motion quality (2.74) without sacrificing the consistency (3.35). Model Average Instruction Adherence Consistency Visual Quality Motion Quality Our results demonstrate that both SkyReels-V2-I2V (3.29) and SkyReels-V2-DF (3.24) achieve state-of-the-art performance among open-source models, significantly outperforming HunyuanVideo-13B (2.84) and Wan2.1-14B (2.85) across all quality dimensions. With an average score of 3.29, SkyReels-V2-I2V demonstrates comparable performance to proprietary models Kling-1.6 (3.4) and Runway-Gen4 (3.39). VBench To objectively compare SkyReels-V2 Model against other leading open-source Text-To-Video models, we conduct comprehensive evaluations using the public benchmark V-Bench . Our evaluation specifically leverages the benchmark’s longer version prompt. For fair comparison with baseline models, we strictly follow their recommended setting for inference. The VBench results demonstrate that SkyReels-V2 outperforms all compared models including HunyuanVideo-13B and Wan2.1-14B, With the highest total score (83.9%) and quality score (84.7%). In this evaluation, the semantic score is slightly lower than Wan2.1-14B, while we outperform Wan2.1-14B in human evaluations, with the primary gap attributed to V-Bench’s insufficient evaluation of shot-scenario semantic adherence. Acknowledgements We would like to thank the contributors of Wan 2.1 , XDit and Qwen 2.5 repositories, for their open research and contributions.

NaNK
β€”
749
27

SkyReels-V2-T2V-14B-720P

NaNK
β€”
724
38

Skywork-R1V3-38B

NaNK
license:mit
665
87

SkyReels-V1-Hunyuan-I2V

--- This repo contains Diffusers-format model weights for SkyReels V1 Image-to-Video models. You can find the inference code on our github repository SkyReels-V1. Introduction SkyReels V1 is the first and most advanced open-source human-centric video foundation model. By fine-tuning HunyuanVideo on O(10M) high-quality film and television clips, Skyreels V1 offers three key advantages: 1. Open-Source Leadership: Our Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo. 2. Advanced Facial Animation: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions. 3. Cinematic Lighting and Aesthetics: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles. 1. Self-Developed Data Cleaning and Annotation Pipeline Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content. - Expression Classification: Categorizes human facial expressions into 33 distinct types. - Character Spatial Awareness: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning. - Action Recognition: Constructs over 400 action semantic units to achieve a precise understanding of human actions. - Scene Understanding: Conducts cross-modal correlation analysis of clothing, scenes, and plots. Our multi-stage pretraining pipeline, inspired by the HunyuanVideo design, consists of the following stages: - Stage 1: Model Domain Transfer Pretraining: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain. - Stage 2: Image-to-Video Model Pretraining: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1. - Stage 3: High-Quality Fine-Tuning: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality. Model Introduction | Model Name | Resolution | Video Length | FPS | Download Link | |-----------------|------------|--------------|-----|---------------| | SkyReels-V1-Hunyuan-I2V (Current) | 544px960p | 97 | 24 | πŸ€— Download | | SkyReels-V1-Hunyuan-T2V | 544px960p | 97 | 24 | πŸ€— Download |

β€”
652
273

Skywork-Reward-Llama-3.1-8B

NaNK
llama
581
30

Skywork-VL-Reward-7B

NaNK
license:mit
562
46

SkyReels-V1-Hunyuan-T2V

license:apache-2.0
548
90

Skywork-Reward-Gemma-2-27B

NaNK
β€”
509
46

Skywork-13B-base

NaNK
β€”
505
69

Skywork-Reward-Gemma-2-27B-v0.2

This model is based on the transformers library and utilizes the google/gemma-2-27b-it base model.

NaNK
β€”
465
33

SkyCaptioner-V1

license:apache-2.0
366
49

Skywork-OR1-Math-7B

NaNK
β€”
355
13

SkyReels V2 I2V 14B 720P

πŸ“‘ Technical Report Β· πŸ‘‹ Playground Β· πŸ’¬ Discord Β· πŸ€— Hugging Face Β· πŸ€– ModelScope Β· 🌐 GitHub --- Welcome to the SkyReels V2 repository! Here, you'll find the model weights for our infinite-length film generative models. To the best of our knowledge, it represents the first open-source video generative model employing AutoRegressive Diffusion-Forcing architecture that achieves the SOTA performance among publicly available models. πŸ”₯πŸ”₯πŸ”₯ News!! Apr 24, 2025: πŸ”₯ We release the 720P models, SkyReels-V2-DF-14B-720P and SkyReels-V2-I2V-14B-720P. The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis. Apr 21, 2025: πŸ‘‹ We release the inference code and model weights of SkyReels-V2 Series Models and the video captioning model SkyCaptioner-V1 . Apr 3, 2025: πŸ”₯ We also release SkyReels-A2. This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements. Feb 18, 2025: πŸ”₯ we released SkyReels-A1. This is an open-sourced and effective framework for portrait image animation. Feb 18, 2025: πŸ”₯ We released SkyReels-V1. This is the first and most advanced open-source human-centric video foundation model. The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model. - [x] Technical Report - [x] Checkpoints of the 14B and 1.3B Models Series - [x] Single-GPU & Multi-GPU Inference Code - [x] SkyCaptioner-V1 : A Video Captioning Model - [x] Prompt Enhancer - [ ] Diffusers integration - [ ] Checkpoints of the 5B Models Series - [ ] Checkpoints of the Camera Director Models - [ ] Checkpoints of the Step & Guidance Distill Model Model Download You can download our models from Hugging Face: Type Model Variant Recommended Height/Width/Frame Link Diffusion Forcing 1.3B-540P 544 960 97f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope Image-to-Video 1.3B-540P 544 960 97f πŸ€— Huggingface πŸ€– ModelScope 14B-720P 720 1280 121f πŸ€— Huggingface πŸ€– ModelScope After downloading, set the model path in your generation commands: The Diffusion Forcing version model allows us to generate Infinite-Length videos. This model supports both text-to-video (T2V) and image-to-video (I2V) tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first. > Note: > - If you want to run the image-to-video (I2V) task, add `--image ${imagepath}` to your command and it is also better to use text-to-video (T2V)-like prompt which includes some descriptions of the first-frame image. > - For long video generation, you can just switch the `--numframes`, e.g., `--numframes 257` for 10s video, `--numframes 377` for 15s video, `--numframes 737` for 30s video, `--numframes 1457` for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causalblocksize > 1, the `--numframes` should be carefully set. > - You can use `--arstep 5` to enable asynchronous inference. When asynchronous inference, `--causalblocksize 5` is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97-1)//4+1=25 for basenumframes=97) and (e.g., (237-97-(97-17)x1+17-1)//4+1=20 for basenumframes=97, numframes=237, overlaphistory=17) for the last iteration, MUST be divided by causalblocksize. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance. > - To reduce peak VRAM, just lower the `--basenumframes`, e.g., to 77 or 57, while keeping the same generative length `--numframes` you want to generate. This may slightly reduce video quality, and it should not be set too small. > - `--addnoisecondition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50. > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM. > Note: > - When using an image-to-video (I2V) model, you must provide an input image using the `--image ${imagepath}` parameter. The `--guidancescale 5.0` and `--shift 3.0` is recommended for I2V model. > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM. The prompt enhancer is implemented based on Qwen2.5-32B-Instruct and is utilized via the `--promptenhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if you use `--promptenhancer`. If you want to obtain the enhanced prompt separately, you can also run the promptenhancer script separately for testing. The steps are as follows: > Note: > - `--promptenhancer` is not allowed if using `--useusp`. We recommend running the skyreelsv2infer/pipelines/promptenhancer.py script first to generate enhanced prompt before enabling the `--useusp` parameter. Below are the key parameters you can customize for video generation: | Parameter | Recommended Value | Description | |:----------------------:|:---------:|:-----------------------------------------:| | --prompt | | Text description for generating your video | | --image | | Path to input image for image-to-video generation | | --resolution | 540P or 720P | Output video resolution (select based on model type) | | --numframes | 97 or 121 | Total frames to generate (97 for 540P models, 121 for 720P models) | | --inferencesteps | 50 | Number of denoising steps | | --fps | 24 | Frames per second in the output video | | --shift | 8.0 or 5.0 | Flow matching scheduler parameter (8.0 for T2V, 5.0 for I2V) | | --guidancescale | 6.0 or 5.0 | Controls text adherence strength (6.0 for T2V, 5.0 for I2V) | | --seed | | Fixed seed for reproducible results (omit for random generation) | | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) | | --useusp | True | Enables multi-GPU acceleration with xDiT USP | | --outdir | ./videoout | Directory where generated videos will be saved | | --promptenhancer | True | Expand the prompt into a more detailed description | | --teacache | False | Enables teacache for faster inference | | --teacachethresh | 0.2 | Higher speedup will cause to worse quality | | --useretsteps | False | Retention Steps for teacache | Diffusion Forcing Additional Parameters | Parameter | Recommended Value | Description | |:----------------------:|:---------:|:-----------------------------------------:| | --arstep | 0 | Controls asynchronous inference (0 for synchronous mode) | | --basenumframes | 97 or 121 | Base frame count (97 for 540P, 121 for 720P) | | --overlaphistory | 17 | Number of frames to overlap for smooth transitions in long videos | | --addnoisecondition | 20 | Improves consistency in long video generation | | --causalblocksize | 5 | Recommended when using asynchronous inference (--arstep > 0) | We use xDiT USP to accelerate inference. For example, to generate a video with 2 GPUs, you can use the following command: - Diffusion Forcing > Note: > - When using an image-to-video (I2V) model, you must provide an input image using the `--image ${imagepath}` parameter. The `--guidancescale 5.0` and `--shift 3.0` is recommended for I2V model. Contents - Abstract - Methodology of SkyReels-V2 - Key Contributions of SkyReels-V2 - Video Captioner - Reinforcement Learning - Diffusion Forcing - High-Quality Supervised Fine-Tuning(SFT) - Performance - Acknowledgements - Citation --- Abstract Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we introduce SkyReels-V2, the world's first infinite-length film generative model using a Diffusion Forcing framework. Our approach synergizes Multi-modal Large Language Models (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing techniques to achieve comprehensive optimization. Beyond its technical innovations, SkyReels-V2 enables multiple practical applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and multi-subject consistent video generation through our Skyreels-A2 system. The SkyReels-V2 methodology consists of several interconnected components. It starts with a comprehensive data processing pipeline that prepares various quality training data. At its core is the Video Captioner architecture, which provides detailed annotations for video content. The system employs a multi-task pretraining strategy to build fundamental video generation capabilities. Post-training optimization includes Reinforcement Learning to enhance motion quality, Diffusion Forcing Training for generating extended videos, and High-quality Supervised Fine-Tuning (SFT) stages for visual refinement. The model runs on optimized computational infrastructure for efficient training and inference. SkyReels-V2 supports multiple applications, including Story Generation, Image-to-Video Synthesis, Camera Director functionality, and Elements-to-Video Generation. SkyCaptioner-V1 serves as our video captioning model for data annotation. This model is trained on the captioning result from the base model Qwen2.5-VL-72B-Instruct and the sub-expert captioners on a balanced video data. The balanced video data is a carefully curated dataset of approximately 2 million videos to ensure conceptual balance and annotation quality. Built upon the Qwen2.5-VL-7B-Instruct foundation model, SkyCaptioner-V1 is fine-tuned to enhance performance in domain-specific video captioning tasks. To compare the performance with the SOTA models, we conducted a manual assessment of accuracy across different captioning fields using a test set of 1,000 samples. The proposed SkyCaptioner-V1 achieves the highest average accuracy among the baseline models, and show a dramatic result in the shot related fields model Qwen2.5-VL-7B-Ins. Qwen2.5-VL-72B-Ins. Tarsier2-Recap-7b SkyCaptioner-V1 Reinforcement Learning Inspired by the previous success in LLM, we propose to enhance the performance of the generative model by Reinforcement Learning. Specifically, we focus on the motion quality because we find that the main drawback of our generative model is: - the generative model does not handle well with large, deformable motions. - the generated videos may violate the physical law. To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairs and human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motion quality of the generative model. We introduce the Diffusion Forcing Transformer to unlock our model’s ability to generate long videos. Diffusion Forcing is a training and sampling strategy where each token is assigned an independent noise level. This allows tokens to be denoised according to arbitrary, per-token schedules. Conceptually, this approach functions as a form of partial masking: a token with zero noise is fully unmasked, while complete noise fully masks it. Diffusion Forcing trains the model to "unmask" any combination of variably noised tokens, using the cleaner tokens as conditional information to guide the recovery of noisy ones. Building on this, our Diffusion Forcing Transformer can extend video generation indefinitely based on the last frames of the previous segment. Note that the synchronous full sequence diffusion is a special case of Diffusion Forcing, where all tokens share the same noise level. This relationship allows us to fine-tune the Diffusion Forcing Transformer from a full-sequence diffusion model. We implement two sequential high-quality supervised fine-tuning (SFT) stages at 540p and 720p resolutions respectively, with the initial SFT phase conducted immediately after pretraining but prior to reinforcement learning (RL) stage.This first-stage SFT serves as a conceptual equilibrium trainer, building upon the foundation model’s pretraining outcomes that utilized only fps24 video data, while strategically removing FPS embedding components to streamline thearchitecture. Trained with the high-quality concept-balanced samples, this phase establishes optimized initialization parameters for subsequent training processes. Following this, we execute a secondary high-resolution SFT at 720p after completing the diffusion forcing stage, incorporating identical loss formulations and the higher-quality concept-balanced datasets by the manually filter. This final refinement phase focuses on resolution increase such that the overall video quality will be further enhanced. To comprehensively evaluate our proposed method, we construct the SkyReels-Bench for human assessment and leveraged the open-source V-Bench for automated evaluation. This allows us to compare our model with the state-of-the-art (SOTA) baselines, including both open-source and proprietary models. For human evaluation, we design SkyReels-Bench with 1,020 text prompts, systematically assessing three dimensions: Instruction Adherence, Motion Quality, Consistency and Visual Quality. This benchmark is designed to evaluate both text-to-video (T2V) and image-to-video (I2V) generation models, providing comprehensive assessment across different generation paradigms. To ensure fairness, all models were evaluated under default settings with consistent resolutions, and no post-generation filtering was applied. Model Name Average Instruction Adherence Consistency Visual Quality Motion Quality The evaluation demonstrates that our model achieves significant advancements in instruction adherence (3.15) compared to baseline methods, while maintaining competitive performance in motion quality (2.74) without sacrificing the consistency (3.35). Model Average Instruction Adherence Consistency Visual Quality Motion Quality Our results demonstrate that both SkyReels-V2-I2V (3.29) and SkyReels-V2-DF (3.24) achieve state-of-the-art performance among open-source models, significantly outperforming HunyuanVideo-13B (2.84) and Wan2.1-14B (2.85) across all quality dimensions. With an average score of 3.29, SkyReels-V2-I2V demonstrates comparable performance to proprietary models Kling-1.6 (3.4) and Runway-Gen4 (3.39). VBench To objectively compare SkyReels-V2 Model against other leading open-source Text-To-Video models, we conduct comprehensive evaluations using the public benchmark V-Bench . Our evaluation specifically leverages the benchmark’s longer version prompt. For fair comparison with baseline models, we strictly follow their recommended setting for inference. The VBench results demonstrate that SkyReels-V2 outperforms all compared models including HunyuanVideo-13B and Wan2.1-14B, With the highest total score (83.9%) and quality score (84.7%). In this evaluation, the semantic score is slightly lower than Wan2.1-14B, while we outperform Wan2.1-14B in human evaluations, with the primary gap attributed to V-Bench’s insufficient evaluation of shot-scenario semantic adherence. Acknowledgements We would like to thank the contributors of Wan 2.1 , XDit and Qwen 2.5 repositories, for their open research and contributions.

NaNK
β€”
326
31

Skywork-o1-Open-PRM-Qwen-2.5-7B

NaNK
β€”
302
51

SkyReels-V2-DF-1.3B-540P-Diffusers

NaNK
β€”
203
1

Skywork-R1V2-38B

Skywork-R1V2-38B is a state-of-the-art open-source multimodal reasoning model, achieving top-tier performance across multiple benchmarks: - On MMMU, it scores 73.6%, the highest among all open-source models to date. - On OlympiadBench, it achieves 62.6%, leading by a large margin over other open models. - R1V2 also performs strongly on MathVision, MMMU-Pro, and MathVista, rivaling proprietary commercial models. - Overall, R1V2 stands out as a high-performing, open-source VLM combining powerful visual reasoning and text understanding. Model Name Vision Encoder Language Model Hugging Face Link Skywork-R1V2-38B InternViT-6B-448px-V25 Qwen/QwQ-32B πŸ€— Link section { margin-bottom: 4em; } figure { margin: 2em 0; text-align: center; } figcaption { font-weight: bold; margin-top: 0.5em; } table { margin: 3em auto; width: 100%; border-collapse: collapse; } table th, table td { padding: 0.6em; border: 1px solid #ddd; text-align: center; } Model Supports Vision Text Reasoning (%) Multimodal Reasoning (%) AIME24 LiveCodebench liveBench IFEVAL BFCL GPQA MMMU(val) MathVista(mini) MathVision(mini) OlympiadBench mmmu‑pro R1V2‑38B βœ… 78.9 63.6 73.2 82.9 66.3 61.6 73.6 74.0 49.0 62.6 52.0 R1V1‑38B βœ… 72.0 57.2 54.6 72.5 53.5 – 68.0 67.0 – 40.4 – Deepseek‑R1‑671B ❌ 74.3 65.9 71.6 83.3 60.3 71.5 – – – – – GPT‑o4‑mini βœ… 93.4 74.6 78.1 – – 49.9 81.6 84.3 58.0 – – Qwen2.5‑VL‑72B‑Instruct βœ… – – – – – – 70.2 74.8 – – – Evaluation Results of State-of-the-Art LLMs and VLMs 4. Citation If you use Skywork-R1V in your research, please cite: This project is released under an open-source license.

NaNK
license:mit
198
126

SkyReels-V2-T2V-14B-540P

NaNK
β€”
188
14

MindLink-72B-0801

NaNK
license:apache-2.0
172
33

SkyReels-V2-DF-14B-540P

NaNK
β€”
149
17

SkyReels-V2-DF-14B-540P-Diffusers

NaNK
β€”
148
3

Skywork-R1V3-38B-GGUF

NaNK
license:mit
144
7

Skywork-MoE-Base

β€”
137
41

Skywork-Critic-Llama-3.1-8B

NaNK
llama
132
12

Skywork-13B-Math

NaNK
β€”
129
10

SkyReels-V2-I2V-1.3B-540P-Diffusers

NaNK
β€”
125
2

Skywork-SWE-32B

NaNK
license:apache-2.0
124
77

MindLink-32B-0801

NaNK
license:apache-2.0
122
35

Skywork-Critic-Llama-3.1-70B

NaNK
llama
120
10

SkyReels-V2-I2V-1.3B-540P

NaNK
β€”
114
43

Skywork-MoE-Base-FP8

β€”
113
7

Skywork-Reward-V2-Llama-3.2-3B

NaNK
llama
112
4

Skywork-OR1-7B-Preview

NaNK
β€”
96
13

Skywork-OR1-32B-Preview

NaNK
β€”
95
68

Unipic3

license:mit
71
18

Skywork-R1V-38B-AWQ

NaNK
license:mit
71
7

Skywork-R1V3-38B-AWQ

[](https://github.com/SkyworkAI/Skywork-R1V/stargazers)[](https://github.com/SkyworkAI/Skywork-R1V/fork) Comprehensive performance comparison across text and multimodal reasoning benchmarks. Usage You can use the quantized model with different inference frameworks: Using VLLM The AWQ quantization reduces the memory footprint compared to the original FP16 model. We recommend: - At least one GPU with 30GB+ VRAM for inference - For optimal performance with longer contexts, 40GB+ VRAM is recommended If you use this model in your research, please cite:

NaNK
license:mit
71
5

Skywork-UniPic-1.5B

NaNK
license:mit
70
113

Skywork-R1V2-38B-AWQ

[](https://github.com/SkyworkAI/Skywork-R1V/stargazers)[](https://github.com/SkyworkAI/Skywork-R1V/fork) Comprehensive performance comparison across text and multimodal reasoning benchmarks. Model MMMU MathVista MathVision Olympiad Bench AIME 24 LiveCode bench Live Bench IFEVAL Skywork-R1V2 73.6 74.0 49.0 62.6 78.9 63.6 73.2 82.9 Skywork-R1V2-AWQ 64.4 64.8 42.9 54.8 77.3 55.7 64.1 72.5 Usage You can use the quantized model with different inference frameworks: Using VLLM The AWQ quantization reduces the memory footprint compared to the original FP16 model. We recommend: - At least one GPU with 30GB+ VRAM for inference - For optimal performance with longer contexts, 40GB+ VRAM is recommended If you use this model in your research, please cite:

NaNK
license:mit
69
11

Skywork-13B-Base-8bits

NaNK
β€”
65
7

SkyReels-V2-I2V-14B-540P

NaNK
β€”
64
83

SkyworkVL-38B

NaNK
license:apache-2.0
64
11

SkyworkVL-2B

NaNK
license:apache-2.0
64
8

Skywork-13B-Math-8bits

NaNK
β€”
63
4

UniPic2-SD3.5M-Kontext-2B

NaNK
license:mit
60
23

Unipic3-DMD

license:mit
60
10

SkyReels-A2

license:apache-2.0
55
138

SkyReels-V2-DF-14B-720P-Diffusers

NaNK
β€”
54
5

UniPic2-SD3.5M-Kontext-GRPO-2B

UniPic2-SD3.5M-Kontext-GRPO-2B is a post-trained grpo version T2I model built on the UniPic2-SD3.5M-Kontext-2B, with enhanced text rendering. It excels at text-to-image generation and image editing, delivering high quality at fast speeds, and runs smoothly on a single 16 GB consumer GPU. Citation If you use Skywork-UniPic in your research, please cite:

NaNK
license:mit
50
9

UniPic2 Metaquery 9B

UniPic2-Metaquery-9B is an unified multimodal model built on Qwen2.5-VL-Instruct and SD3.5-Medium. It delivers end-to-end image understanding, text-to-image (T2I) generation, and image editing. Requires approximately 40 GB VRAM. For NVIDIA RTX 40-series GPUs, we recommend using the Skywork/UniPic2-Metaquery-Flash UniPic2-Metaquery-9B w/o GRPO achieves competitive results across a variety of vision-language tasks: | Task | Score | |--------------------|--------| | 🧠 GenEval | 0.86 | | πŸ–ΌοΈ DPG-Bench | 83.63 | | βœ‚οΈ GEditBench-EN | 6.90 | | πŸ§ͺ ImgEdit-Bench | 4.10 | πŸ“„ License This model is released under the MIT License. Citation If you use Skywork-UniPic in your research, please cite:

NaNK
license:mit
46
19

UniPic2-Metaquery-GRPO-9B

NaNK
license:mit
36
5

SkyReels-A1

NaNK
license:apache-2.0
35
63

UniPic2-Metaquery-GRPO-Flash

license:mit
34
4

UniPic2-Metaquery-Flash

UniPic2-Metaquery-Flash is a quantized variant of UniPic2-MetaQuery, offering end-to-end image understanding, text-to-image (T2I) generation, and image editing. Optimized for efficiency, it runs smoothly on NVIDIA RTX 40-series GPUs with under 16 GB VRAM β€” without any performance degradation. UniPic2-Metaquery-9B w/o GRPO achieves competitive results across a variety of vision-language tasks: | Task | Score | |--------------------|--------| | 🧠 GenEval | 0.86 | | πŸ–ΌοΈ DPG-Bench | 83.63 | | βœ‚οΈ GEditBench-EN | 6.90 | | πŸ§ͺ ImgEdit-Bench | 4.10 | πŸ“„ License This model is released under the MIT License. Citation If you use Skywork-UniPic in your research, please cite:

license:mit
32
6

Unipic3-Consistency-Model

license:mit
31
6

SkyReels-V2-I2V-14B-720P-Diffusers

NaNK
β€”
30
4

SkyReels-V2-T2V-14B-720P-Diffusers

NaNK
β€”
28
7

Matrix-Game-3.0

NaNK
license:apache-2.0
14
105

SkyReels-V2-T2V-14B-540P-Diffusers

NaNK
β€”
12
1

SkyReels-V2-I2V-14B-540P-Diffusers

NaNK
β€”
7
1

Matrix-Game-2.0

license:mit
0
292

Matrix 3D

NaNK
license:mit
0
48

Matrix-Game

NaNK
license:mit
0
46

Skywork-13B-Base-Intermediate

NaNK
β€”
0
7

R1V4

license:mit
0
4