TIGER-Lab

81 models • 1 total models in database

Sort by:

PixelReasoner-RL-v1

The model is trained with curiosity-driven RL described in paper. We have released vllm based inference code at https://github.com/TIGER-AI-Lab/Pixel-Reasoner/. Project page: https://tiger-ai-lab.github.io/Pixel-Reasoner/ Github repository: https://github.com/TIGER-AI-Lab/Pixel-Reasoner/ We will release a simple hf.generate() based inference code.

license:apache-2.0

3,248

VideoScore-v1.1

license:mit

2,980

Mantis-8B-siglip-llama3

NaNK

llama3

2,227

VLM2Vec-Qwen2VL-7B

NaNK

license:apache-2.0

1,802

PixelReasoner-WarmStart

This is the model trained with https://github.com/TIGER-AI-Lab/Pixel-Reasoner/.

NaNK

license:apache-2.0

944

VL-Rethinker-7B

NaNK

license:apache-2.0

901

TIGERScore-13B

NaNK

llama

736

VideoScore2

📃Paper | 🌐Website | 💻Code | 🛢️Dataset (VideoFeedback2) | 🤗Model (VideoScore2) | 🤗Space (VideoScore2) 🤔Ablation1: SFT-only | 🤔Ablation2: SFT-w/o-CoT | 🤔Ablation3: RL-w/o-SFT Introduction We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc). Inference For running inference of VideoScore2, firstly install: Training (SFT and RL) see VideoScore2/training for details

NaNK

license:apache-2.0

604

EditReward-MiMo-VL-7B-SFT-2508

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [](https://tiger-ai-lab.github.io/EditReward/) [](https://arxiv.org/abs/2509.26346) [](https://huggingface.co/papers/2509.26346) [](https://huggingface.co/collections/TIGER-Lab/editreward-68ddf026ef9eb1510458abc6) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Bench) This repository contains the official implementation of the paper EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing. 📖 Introduction We introduce EditReward, a human-aligned reward model powered by a high-quality dataset for instruction-guided image editing. EditReward is trained with EditReward-Data, a large-scale, high-fidelity preference dataset comprising over 200K manually annotated preference pairs. This dataset covers diverse edits produced by seven state-of-the-art models across twelve distinct sources, ensuring high alignment with human judgment. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks, achieving state-of-the-art human correlation on established benchmarks like GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench. 🚀 Quick Start To use the EditReward model for inference, follow these steps. For more details, including installation and training, please refer to the GitHub Repository. 📊 Benchmark EditReward achieves superior alignment with human preferences in instruction-guided image editing tasks. The following tables show its performance against other models on various benchmarks. | Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.90 | 33.43 | -- | 13.84 | | Human-to-Human | -- | -- | 41.84 | -- | | Proprietary Models | | | | | | GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 | | GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 | | Gemini-2.0-Flash | 53.32 | 44.31 | 23.69 | 33.47 | | Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 42.76 | 30.69 | -2.54 | 26.86 | | Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 | | Qwen2.5-VL-32B-Inst | 39.28 | 37.06 | 26.87 | 28.72 | | MiMo-VL-7B-SFT-2508 | 57.89 | 30.43 | 22.14 | 31.19 | | ADIEE | 59.96 | 55.56 | 34.50 | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 63.97 | 59.50 | 36.18 | 36.78 | | EditReward (on MiMo-VL-7B) | 65.72 | 63.62 | 35.20 | 38.42 | | Method | EditReward-Bench (K=2) | EditReward-Bench (K=3) | EditReward-Bench (K=4) | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.81 | 11.33 | 1.35 | 13.84 | | Human-to-Human | -- | -- | -- | -- | | Proprietary Models | | | | | | GPT-4o | 45.69 | 27.33 | 7.31 | 28.31 | | GPT-5 | 57.53 | 38.51 | 12.84 | 37.81 | | Gemini-2.0-Flash | 52.43 | 33.33 | 13.51 | 33.47 | | Gemini-2.5-Flash | 58.61 | 39.86 | 12.16 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 51.07 | 20.27 | 2.71 | 26.86 | | Qwen2.5-VL-7B-Inst | 52.69 | 24.67 | 3.38 | 29.75 | | Qwen2.5-VL-32B-Inst | 50.54 | 25.27 | 4.05 | 28.72 | | MiMo-VL-7B-SFT-2508 | 49.46 | 30.41 | 9.46 | 31.19 | | ADIEE | -- | -- | -- | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 56.99 | 36.00 | 10.81 | 36.78 | | EditReward (on MiMo-VL-7B) | 56.45 | 42.67 | 11.49 | 38.42 | 📚 Citation Please kindly cite our paper if you use our code, data, models or results: 🙏 Acknowledgements We would like to thank the HPSv3, VideoAlign and GenAI-Bench codebase for providing valuable references. --- ⭐ Star History [](https://star-history.com/#TIGER-AI-Lab/EditReward&Date) 💬 Support For questions and support: - Issues: GitHub Issues - Email: [email protected] & [email protected]

NaNK

license:apache-2.0

378

ConsistI2V

license:mit

305

General-Reasoner-Qwen2.5-7B

NaNK

license:apache-2.0

248

VLM2Vec-LLaVa-Next

NaNK

license:apache-2.0

204

Mantis-8B-clip-llama3

NaNK

llama3

179

Mantis-8B-Idefics2

NaNK

llama3

168

MAmmoTH-Coder-7B

NaNK

llama

114

Mantis-llava-7b

NaNK

license:apache-2.0

101

Mantis-8B-Fuyu

NaNK

license:cc-by-nc-4.0

VL-Rethinker-72B

NaNK

license:apache-2.0

VisCoder2-14B

NaNK

license:apache-2.0

MAmmoTH-7B

NaNK

llama

MAmmoTH-VL2

NaNK

license:apache-2.0

VideoScore

license:apache-2.0

BrowserAgent-SFT

Model We release the SFT (Supervised Fine-Tuned) model used in BrowserAgent, based on `Qwen/Qwen2.5-7B-Instruct`. This model learns structured web-browsing behaviors—such as click, type, scroll, read, submit—from human-style demonstrations and produces schema-constrained action sequences for browser environments. Paper BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

NaNK

license:apache-2.0

VisCoder2-7B

NaNK

license:apache-2.0

VisCoder2-3B

NaNK

license:apache-2.0

VideoScore2-SFT

This is an ablation variant of VideoScore2, please refer to main model for more details: 🤗Model (VideoScore2)

NaNK

license:apache-2.0

SWE-Next-14B

NaNK

license:mit

VisCoder2-32B

NaNK

license:apache-2.0

SWE-Next-7B

NaNK

license:mit

BrowserAgent-RFT

Model We release the RFT (Reward Fine-Tuned) model used in BrowserAgent, initialized from the SFT checkpoint of `Qwen/Qwen2.5-7B-Instruct`. This model further optimizes browsing trajectories with task-level reward signals that encourage higher success rate, shorter action paths, and safer interactions. Paper BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

NaNK

license:apache-2.0

Mantis-bakllava-7b

NaNK

license:apache-2.0

EditReward-Qwen2.5-VL-7B

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [](https://tiger-ai-lab.github.io/EditReward/) [](https://arxiv.org/abs/2509.26346) [](https://huggingface.co/papers/2509.26346) [](https://huggingface.co/collections/TIGER-Lab/editreward-68ddf026ef9eb1510458abc6) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Bench) This repository contains the official implementation of the paper EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing. 📖 Introduction We introduce EditReward, a human-aligned reward model powered by a high-quality dataset for instruction-guided image editing. EditReward is trained with EditReward-Data, a large-scale, high-fidelity preference dataset comprising over 200K manually annotated preference pairs. This dataset covers diverse edits produced by seven state-of-the-art models across twelve distinct sources, ensuring high alignment with human judgment. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks, achieving state-of-the-art human correlation on established benchmarks like GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench. 🚀 Quick Start To use the EditReward model for inference, follow these steps. For more details including installation and training, please refer to the GitHub Repository. 📊 Benchmark EditReward achieves superior alignment with human preferences in instruction-guided image editing tasks. The following tables show its performance against other models on various benchmarks. | Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.90 | 33.43 | -- | 13.84 | | Human-to-Human | -- | -- | 41.84 | -- | | Proprietary Models | | | | | | GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 | | GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 | | Gemini-2.0-Flash | 53.32 | 44.31 | 23.69 | 33.47 | | Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 42.76 | 30.69 | -2.54 | 26.86 | | Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 | | Qwen2.5-VL-32B-Inst | 39.28 | 37.06 | 26.87 | 28.72 | | MiMo-VL-7B-SFT-2508 | 57.89 | 30.43 | 22.14 | 31.19 | | ADIEE | 59.96 | 55.56 | 34.50 | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 63.97 | 59.50 | 36.18 | 36.78 | | EditReward (on MiMo-VL-7B) | 65.72 | 63.62 | 35.20 | 38.42 | | Method | EditReward-Bench (K=2) | EditReward-Bench (K=3) | EditReward-Bench (K=4) | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.81 | 11.33 | 1.35 | 13.84 | | Human-to-Human | -- | -- | -- | -- | | Proprietary Models | | | | | | GPT-4o | 45.69 | 27.33 | 7.31 | 28.31 | | GPT-5 | 57.53 | 38.51 | 12.84 | 37.81 | | Gemini-2.0-Flash | 52.43 | 33.33 | 13.51 | 33.47 | | Gemini-2.5-Flash | 58.61 | 39.86 | 12.16 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 51.07 | 20.27 | 2.71 | 26.86 | | Qwen2.5-VL-7B-Inst | 52.69 | 24.67 | 3.38 | 29.75 | | Qwen2.5-VL-32B-Inst | 50.54 | 25.27 | 4.05 | 28.72 | | MiMo-VL-7B-SFT-2508 | 49.46 | 30.41 | 9.46 | 31.19 | | ADIEE | -- | -- | -- | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 56.99 | 36.00 | 10.81 | 36.78 | | EditReward (on MiMo-VL-7B) | 56.45 | 42.67 | 11.49 | 38.42 | 📚 Citation Please kindly cite our paper if you use our code, data, models or results: 🙏 Acknowledgements We would like to thank the HPSv3, VideoAlign and GenAI-Bench codebase for providing valuable references. --- ⭐ Star History [](https://star-history.com/#TIGER-AI-Lab/EditReward&Date) 💬 Support For questions and support: - Issues: GitHub Issues - Email: [email protected] & [email protected]

NaNK

license:apache-2.0

VL-Rethinker-32B

NaNK

—

Vamba-Qwen2-VL-7B

NaNK

license:mit

TIGERScore-7B

NaNK

llama

ABC-Qwen2VL-Pretrain

NaNK

—

ABC-Qwen2VL-Instruct

NaNK

license:mit

ScholarCopilot-v1

NaNK

license:apache-2.0

RationalRewards-8B-T2I

NaNK

license:apache-2.0

RationalRewards-8B-Edit

NaNK

license:apache-2.0

StructLM-7B

NaNK

llama

MAmmoTH-13B

NaNK

llama

Qwen2.5-32B-Instruct-CFT

NaNK

license:apache-2.0

MAmmoTH2-8B

NaNK

llama

VL-Reasoner-7B

NaNK

license:apache-2.0

General-Reasoner-Qwen3-4B

General-Reasoner: Advancing LLM Reasoning Across All Domains 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page Figure: Effectiveness of General-Reasoner trained with diverse verifiable reasoning questions using model-based verifier compared to baseline methods on various reasoning tasks. General-Reasoner is a training paradigm for large language models (LLMs), designed to robustly enhance reasoning abilities across diverse domains—not just mathematics and coding, but also physics, chemistry, finance, humanities, and more. Key features: - Zero RL Training: Direct reinforcement learning from base LLMs, bypassing intermediate supervised stages. - Diverse Reasoning Data: 230K+ high-quality, verifiable questions sourced from the web and filtered for answer verifiability across disciplines. - Model-Based Verifier: Compact 1.5B generative verifier model for context-aware, chain-of-thought answer validation, outperforming traditional rule-based methods. This specific model is the General-Reasoner variant trained based on Qwen3-4B-Base. Main Results General-Reasoner outperforms base and supervised models on a variety of reasoning benchmarks, demonstrating robust generalization across domains:

NaNK

license:apache-2.0

VideoScore2-RL-no-SFT

This is an ablation variant of VideoScore2, please refer to main model for more details: 🤗Model (VideoScore2)

NaNK

license:apache-2.0

MAmmoTH2-8x7B

NaNK

license:mit

General-Reasoner-Qwen2.5-14B

NaNK

license:apache-2.0

MAmmoTH-70B

NaNK

llama

MAmmoTH-Coder-13B

NaNK

llama

MAmmoTH2-7B

NaNK

license:mit

VisCoder-7B

NaNK

license:apache-2.0

Critique-Coder-8B

NaNK

license:apache-2.0

VLM2Vec-LoRA

NaNK

license:apache-2.0

One-Shot-CFT-Math-Qwen-14B

One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-14B with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.

NaNK

license:cc-by-4.0

Critique-Coder-4B

NaNK

license:apache-2.0

StructLM-34B

NaNK

llama

MAmmoTH-Coder-34B

NaNK

llama

VL-Reasoner-72B

NaNK

license:apache-2.0

One-Shot-CFT-Math-Qwen-7B

One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-7B-Math with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.

NaNK

license:cc-by-4.0

VideoScore-Qwen2-VL

—

Mantis-8B-siglip-llama3-pretraind

NaNK

license:llama3

MAmmoTH-7B-Mistral

NaNK

license:mit

General-Reasoner-Qwen3-14B

NaNK

license:apache-2.0

VisCoder-3B

NaNK

license:apache-2.0

One-Shot-CFT-Math-Qwen-1.5B

One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-1.5B-Math with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.

NaNK

license:cc-by-4.0

AceCodeRM-32B

NaNK

license:mit

StructLM-7B-Mistral

NaNK

license:mit

StructLM-13B

NaNK

llama

VISTA-LongVA

This repo contains model checkpoints for VISTA-LongVA. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with

license:mit

AceCoder-Qwen2.5-Coder-7B-Ins-V1.1

NaNK

license:mit

VISTA-VideoLLaVA

This repo contains model checkpoints for VISTA-VideoLLaVA. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with

license:mit

AceCoder-Qwen2.5-Coder-7B-Ins-RM

NaNK

license:mit

UniIR

license:mit

VISTA-Mantis

This repo contains model checkpoints for VISTA-Mantis. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with

license:mit

TIGER-Lab

VLM2Vec-Full

MAmmoTH2-8B-Plus

MAmmoTH2-8x7B-Plus

general-verifier

VLM2Vec-Qwen2VL-2B

PixelReasoner-RL-v1

VideoScore-v1.1

Mantis-8B-siglip-llama3

VLM2Vec-Qwen2VL-7B

PixelReasoner-WarmStart

VL-Rethinker-7B

TIGERScore-13B

VideoScore2

EditReward-MiMo-VL-7B-SFT-2508

ConsistI2V

General-Reasoner-Qwen2.5-7B

VLM2Vec-LLaVa-Next

Mantis-8B-clip-llama3

Mantis-8B-Idefics2

MAmmoTH-Coder-7B

Mantis-llava-7b

Mantis-8B-Fuyu

VL-Rethinker-72B

VisCoder2-14B

MAmmoTH-7B

MAmmoTH-VL2

VideoScore

BrowserAgent-SFT

VisCoder2-7B

VisCoder2-3B

VideoScore2-SFT

SWE-Next-14B

VisCoder2-32B

SWE-Next-7B

BrowserAgent-RFT

Mantis-bakllava-7b

EditReward-Qwen2.5-VL-7B

VL-Rethinker-32B

Vamba-Qwen2-VL-7B

TIGERScore-7B

ABC-Qwen2VL-Pretrain

ABC-Qwen2VL-Instruct

ScholarCopilot-v1

RationalRewards-8B-T2I

RationalRewards-8B-Edit

StructLM-7B

MAmmoTH-13B

Qwen2.5-32B-Instruct-CFT

MAmmoTH2-8B

VL-Reasoner-7B

General-Reasoner-Qwen3-4B

VideoScore2-RL-no-SFT

MAmmoTH2-8x7B

General-Reasoner-Qwen2.5-14B

MAmmoTH-70B

MAmmoTH-Coder-13B

MAmmoTH2-7B

VisCoder-7B

Critique-Coder-8B

VLM2Vec-LoRA

One-Shot-CFT-Math-Qwen-14B

Critique-Coder-4B

StructLM-34B

MAmmoTH-Coder-34B

VL-Reasoner-72B

One-Shot-CFT-Math-Qwen-7B

VideoScore-Qwen2-VL

Mantis-8B-siglip-llama3-pretraind

MAmmoTH-7B-Mistral

General-Reasoner-Qwen3-14B

VisCoder-3B

One-Shot-CFT-Math-Qwen-1.5B

AceCodeRM-32B

StructLM-7B-Mistral

StructLM-13B

VISTA-LongVA

AceCoder-Qwen2.5-Coder-7B-Ins-V1.1

VISTA-VideoLLaVA

AceCoder-Qwen2.5-Coder-7B-Ins-RM