chamber111

3 models • 1 total models in database

Sort by:

VPPO-7B

VPPO-7B is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from `Qwen2.5-VL-7B-Instruct` using a novel reinforcement learning algorithm called Visually-Perceptive Policy Optimization (VPPO). The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability. As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence. - Model type: Large Vision-Language Model (LVLM) - Finetuned from model: `Qwen/Qwen2.5-VL-7B-Instruct` The model was fine-tuned on ViRL39K, a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: `TIGER-Lab/ViRL39K`. The model was trained using our Visually-Perceptive Policy Optimization (VPPO) algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step. - Base Model: Qwen2.5-VL-7B-Instruct - Algorithm: VPPO - Epochs: 2 - Learning Rate: 1e-6 - Rollout Batch Size: 384 - Max Response Length: 2048 - Entropy Penalty Coefficient: 0.06 - Gradient Filtering Ratio (k): 0.4 - Advantage Shaping Min (βmin): 0.9 - Training Regime: bf16 mixed precision The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks: - Math & Geometry: Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12 - Logic: LogicVista - Multi-discipline: MMMU-Pro Performance is measured by average accuracy@8, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring. If you use this model in your work, please cite our paper:

NaNK

license:mit

VPPO-32B

VPPO-32B is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 32B parameter version of our model, fine-tuned from `Qwen2.5-VL-32B-Instruct` using a novel reinforcement learning algorithm called Visually-Perceptive Policy Optimization (VPPO). The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability. As a result, VPPO-32B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence. - Model type: Large Vision-Language Model (LVLM) - Finetuned from model: `Qwen/Qwen2.5-VL-32B-Instruct` - Repository: `VPPO-RL` - Paper: `2510.09285` - Training Details The model was fine-tuned on ViRL39K, a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: `TIGER-Lab/ViRL39K`. The model was trained using our Visually-Perceptive Policy Optimization (VPPO) algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step. - Base Model: Qwen2.5-VL-32B-Instruct - Algorithm: VPPO - Epochs: 2 - Learning Rate: 1e-6 - Rollout Batch Size: 384 - Max Response Length: 2048 - Entropy Penalty Coefficient: 0.06 - Gradient Filtering Ratio (k): 0.4 - Advantage Shaping Min (βmin): 0.9 - Training Regime: bf16 mixed precision The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks: - Math & Geometry: Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12 - Logic: LogicVista - Multi-discipline: MMMU-Pro Performance is measured by average accuracy@8, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring. If you use this model in your work, please cite our paper:

NaNK

license:mit

VPPO-8B

VPPO-8B is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 8B parameter version of our model, fine-tuned from `Qwen3-VL-8B-Instruct` using a novel reinforcement learning algorithm called Visually-Perceptive Policy Optimization (VPPO). The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability. As a result, VPPO-8B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence. - Model type: Large Vision-Language Model (LVLM) - Finetuned from model: `Qwen/Qwen3-VL-8B-Instruct` The model was fine-tuned on ViRL39K, a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: `TIGER-Lab/ViRL39K`. The model was trained using our Visually-Perceptive Policy Optimization (VPPO) algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step. - Base Model: Qwen3-VL-8B-Instruct - Algorithm: VPPO - Steps: 150 - Learning Rate: 1e-6 - Rollout Batch Size: 384 - Max Response Length: 8192 - Entropy Penalty Coefficient: 0.12 for steps 0-130; 0.18 for steps 131-150 - Gradient Filtering Ratio (k): 0.4 - Advantage Shaping Min (βmin): 0.9 - Training Regime: bf16 mixed precision The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks: - Math & Geometry: Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12 - Logic: LogicVista - Multi-discipline: MMMU-Pro Performance is measured by average accuracy@8, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring. If you use this model in your work, please cite our paper:

NaNK

license:mit