QiWang98
VideoRFT-SFT
🎥 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning   ⭐️ Project    │   📑 ArXiv -->   📖 ArXiv    │   📀 CoT Dataset    │   📀 RL Dataset    │   🤗 Models 📰 News - [2025/09/19] Our paper has been accepted to NeurIPS 2025 🎉! - [2025/06/01] We released our 3B Models (🤗VideoRFT-SFT-3B and 🤗VideoRFT-3B) to huggingface. - [2025/05/25] We released our 7B Models (🤗VideoRFT-SFT-7B and 🤗VideoRFT-7B) to huggingface. - [2025/05/20] We released our Datasets (📀CoT Dataset and 📀RL Dataset) to huggingface. - [2025/05/18] Our paper is released on ArXiv, and we have open-sourced our code on GitHub! Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks. To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. Based on above pipeline, we construct two large-scale datasets, i.e., 📀VideoRFT-CoT-102K and 📀VideoRFT-RL-310K. Requirements `Python >= 3.11` `Pytorch >= 2.5.1` `transformers == 4.51.3` `vLLM == 0.7.3` `trl == 0.16.0` Supervised Fine-Tuning (SFT) We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch: This step can be skipped by directly using our pretrained SFT models, available at 🤗VideoRFT-SFT-7B or 🤗VideoRFT-SFT-3B. Next, perform reinforcement learning using the VideoRFT-RL dataset: > Note: During training, we adopt the following settings for efficiency: All frame-related configurations can be adjusted in `src/qwen-vl-utils`. > During inference, we increase the maximum frame resolution and length to boost performance: You can configure these parameters in `src/qwen-vl-utils`. > We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo: 1. Download preprocessed evaluation JSONs from: \[🤗 eval] 2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files. We gratefully acknowledge the contributions of the open-source community, particularly DeepSeek-R1, Open-R1, and R1-V. If you find this work helpful, please consider citing:
VideoRFT
🎥 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning This repository contains the `VideoRFT` model, presented in the paper VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning.   📖 ArXiv    │   📀 CoT Dataset    │   📀 RL Dataset    │   🤗 Models 📰 News - [2025/09/19] Our paper has been accepted to NeurIPS 2025 🎉! - [2025/06/01] We released our 3B Models (🤗VideoRFT-SFT-3B and 🤗VideoRFT-3B) to huggingface. - [2025/05/25] We released our 7B Models (🤗VideoRFT-SFT-7B and 🤗VideoRFT-7B) to huggingface. - [2025/05/20] We released our Datasets (📀CoT Dataset and 📀RL Dataset) to huggingface. - [2025/05/18] Our paper is released on ArXiv, and we have open-sourced our code on GitHub! Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks. To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. Based on above pipeline, we construct two large-scale datasets, i.e., 📀VideoRFT-CoT-102K and 📀VideoRFT-RL-310K. Requirements `Python >= 3.11` `Pytorch >= 2.5.1` `transformers == 4.51.3` `vLLM == 0.7.3` `trl == 0.16.0` Supervised Fine-Tuning (SFT) We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch: This step can be skipped by directly using our pretrained SFT models, available at 🤗VideoRFT-SFT-7B or 🤗VideoRFT-SFT-3B. Next, perform reinforcement learning using the VideoRFT-RL dataset: > Note: During training, we adopt the following settings for efficiency: All frame-related configurations can be adjusted in `src/qwen-vl-utils`. > During inference, we increase the maximum frame resolution and length to boost performance: You can configure these parameters in `src/qwen-vl-utils`. > We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo: 1. Download preprocessed evaluation JSONs from: \[🤗 eval] 2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files. We gratefully acknowledge the contributions of the open-source community, particularly DeepSeek-R1, Open-R1, and R1-V. If you find this work helpful, please consider citing:
VideoRFT-3B
VideoRFT-SFT-3B
🎥 VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning This repository contains the VideoRFT model as presented in the paper VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning.   📖 ArXiv    │   📀 CoT Dataset    │   📀 RL Dataset    │   🤗 Models    │   💻 Code Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks. To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. Based on above pipeline, we construct two large-scale datasets, i.e., 📀VideoRFT-CoT-102K and 📀VideoRFT-RL-310K. We provide a simple generation process for using our model. Requirements `Python >= 3.11` `Pytorch >= 2.5.1` `transformers == 4.51.3` `vLLM == 0.7.3` `trl == 0.16.0` Supervised Fine-Tuning (SFT) We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch: This step can be skipped by directly using our pretrained SFT models, available at 🤗VideoRFT-SFT-7B or 🤗VideoRFT-SFT-3B. Next, perform reinforcement learning using the VideoRFT-RL dataset: > Note: During training, we adopt the following settings for efficiency: All frame-related configurations can be adjusted in `src/qwen-vl-utils`. > During inference, we increase the maximum frame resolution and length to boost performance: You can configure these parameters in `src/qwen-vl-utils`. > We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo: 1. Download preprocessed evaluation JSONs from: [🤗 eval] 2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files. We gratefully acknowledge the contributions of the open-source community, particularly DeepSeek-R1, Open-R1, and R1-V. If you find this work helpful, please consider citing: