TIGER-Lab
VLM2Vec-Full
MAmmoTH2-8B-Plus
MAmmoTH2-8x7B-Plus
general-verifier
VLM2Vec-Qwen2VL-2B
PixelReasoner-RL-v1
The model is trained with curiosity-driven RL described in paper. We have released vllm based inference code at https://github.com/TIGER-AI-Lab/Pixel-Reasoner/. Project page: https://tiger-ai-lab.github.io/Pixel-Reasoner/ Github repository: https://github.com/TIGER-AI-Lab/Pixel-Reasoner/ We will release a simple hf.generate() based inference code.
VideoScore-v1.1
Mantis-8B-siglip-llama3
VLM2Vec-Qwen2VL-7B
PixelReasoner-WarmStart
This is the model trained with https://github.com/TIGER-AI-Lab/Pixel-Reasoner/.
VL-Rethinker-7B
TIGERScore-13B
VideoScore2
📃Paper | 🌐Website | 💻Code | 🛢️Dataset (VideoFeedback2) | 🤗Model (VideoScore2) | 🤗Space (VideoScore2) 🤔Ablation1: SFT-only | 🤔Ablation2: SFT-w/o-CoT | 🤔Ablation3: RL-w/o-SFT Introduction We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc). Inference For running inference of VideoScore2, firstly install: Training (SFT and RL) see VideoScore2/training for details
EditReward-MiMo-VL-7B-SFT-2508
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [](https://tiger-ai-lab.github.io/EditReward/) [](https://arxiv.org/abs/2509.26346) [](https://huggingface.co/papers/2509.26346) [](https://huggingface.co/collections/TIGER-Lab/editreward-68ddf026ef9eb1510458abc6) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Bench) This repository contains the official implementation of the paper EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing. 📖 Introduction We introduce EditReward, a human-aligned reward model powered by a high-quality dataset for instruction-guided image editing. EditReward is trained with EditReward-Data, a large-scale, high-fidelity preference dataset comprising over 200K manually annotated preference pairs. This dataset covers diverse edits produced by seven state-of-the-art models across twelve distinct sources, ensuring high alignment with human judgment. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks, achieving state-of-the-art human correlation on established benchmarks like GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench. 🚀 Quick Start To use the EditReward model for inference, follow these steps. For more details, including installation and training, please refer to the GitHub Repository. 📊 Benchmark EditReward achieves superior alignment with human preferences in instruction-guided image editing tasks. The following tables show its performance against other models on various benchmarks. | Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.90 | 33.43 | -- | 13.84 | | Human-to-Human | -- | -- | 41.84 | -- | | Proprietary Models | | | | | | GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 | | GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 | | Gemini-2.0-Flash | 53.32 | 44.31 | 23.69 | 33.47 | | Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 42.76 | 30.69 | -2.54 | 26.86 | | Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 | | Qwen2.5-VL-32B-Inst | 39.28 | 37.06 | 26.87 | 28.72 | | MiMo-VL-7B-SFT-2508 | 57.89 | 30.43 | 22.14 | 31.19 | | ADIEE | 59.96 | 55.56 | 34.50 | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 63.97 | 59.50 | 36.18 | 36.78 | | EditReward (on MiMo-VL-7B) | 65.72 | 63.62 | 35.20 | 38.42 | | Method | EditReward-Bench (K=2) | EditReward-Bench (K=3) | EditReward-Bench (K=4) | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.81 | 11.33 | 1.35 | 13.84 | | Human-to-Human | -- | -- | -- | -- | | Proprietary Models | | | | | | GPT-4o | 45.69 | 27.33 | 7.31 | 28.31 | | GPT-5 | 57.53 | 38.51 | 12.84 | 37.81 | | Gemini-2.0-Flash | 52.43 | 33.33 | 13.51 | 33.47 | | Gemini-2.5-Flash | 58.61 | 39.86 | 12.16 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 51.07 | 20.27 | 2.71 | 26.86 | | Qwen2.5-VL-7B-Inst | 52.69 | 24.67 | 3.38 | 29.75 | | Qwen2.5-VL-32B-Inst | 50.54 | 25.27 | 4.05 | 28.72 | | MiMo-VL-7B-SFT-2508 | 49.46 | 30.41 | 9.46 | 31.19 | | ADIEE | -- | -- | -- | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 56.99 | 36.00 | 10.81 | 36.78 | | EditReward (on MiMo-VL-7B) | 56.45 | 42.67 | 11.49 | 38.42 | 📚 Citation Please kindly cite our paper if you use our code, data, models or results: 🙏 Acknowledgements We would like to thank the HPSv3, VideoAlign and GenAI-Bench codebase for providing valuable references. --- ⭐ Star History [](https://star-history.com/#TIGER-AI-Lab/EditReward&Date) 💬 Support For questions and support: - Issues: GitHub Issues - Email: [email protected] & [email protected]
ConsistI2V
General-Reasoner-Qwen2.5-7B
VLM2Vec-LLaVa-Next
Mantis-8B-clip-llama3
Mantis-8B-Idefics2
MAmmoTH-Coder-7B
Mantis-llava-7b
Mantis-8B-Fuyu
VL-Rethinker-72B
VisCoder2-14B
MAmmoTH-7B
MAmmoTH-VL2
VideoScore
BrowserAgent-SFT
Model We release the SFT (Supervised Fine-Tuned) model used in BrowserAgent, based on `Qwen/Qwen2.5-7B-Instruct`. This model learns structured web-browsing behaviors—such as click, type, scroll, read, submit—from human-style demonstrations and produces schema-constrained action sequences for browser environments. Paper BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
VisCoder2-7B
VisCoder2-3B
VideoScore2-SFT
This is an ablation variant of VideoScore2, please refer to main model for more details: 🤗Model (VideoScore2)
SWE-Next-14B
VisCoder2-32B
SWE-Next-7B
BrowserAgent-RFT
Model We release the RFT (Reward Fine-Tuned) model used in BrowserAgent, initialized from the SFT checkpoint of `Qwen/Qwen2.5-7B-Instruct`. This model further optimizes browsing trajectories with task-level reward signals that encourage higher success rate, shorter action paths, and safer interactions. Paper BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions
Mantis-bakllava-7b
EditReward-Qwen2.5-VL-7B
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [](https://tiger-ai-lab.github.io/EditReward/) [](https://arxiv.org/abs/2509.26346) [](https://huggingface.co/papers/2509.26346) [](https://huggingface.co/collections/TIGER-Lab/editreward-68ddf026ef9eb1510458abc6) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Bench) This repository contains the official implementation of the paper EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing. 📖 Introduction We introduce EditReward, a human-aligned reward model powered by a high-quality dataset for instruction-guided image editing. EditReward is trained with EditReward-Data, a large-scale, high-fidelity preference dataset comprising over 200K manually annotated preference pairs. This dataset covers diverse edits produced by seven state-of-the-art models across twelve distinct sources, ensuring high alignment with human judgment. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks, achieving state-of-the-art human correlation on established benchmarks like GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench. 🚀 Quick Start To use the EditReward model for inference, follow these steps. For more details including installation and training, please refer to the GitHub Repository. 📊 Benchmark EditReward achieves superior alignment with human preferences in instruction-guided image editing tasks. The following tables show its performance against other models on various benchmarks. | Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.90 | 33.43 | -- | 13.84 | | Human-to-Human | -- | -- | 41.84 | -- | | Proprietary Models | | | | | | GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 | | GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 | | Gemini-2.0-Flash | 53.32 | 44.31 | 23.69 | 33.47 | | Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 42.76 | 30.69 | -2.54 | 26.86 | | Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 | | Qwen2.5-VL-32B-Inst | 39.28 | 37.06 | 26.87 | 28.72 | | MiMo-VL-7B-SFT-2508 | 57.89 | 30.43 | 22.14 | 31.19 | | ADIEE | 59.96 | 55.56 | 34.50 | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 63.97 | 59.50 | 36.18 | 36.78 | | EditReward (on MiMo-VL-7B) | 65.72 | 63.62 | 35.20 | 38.42 | | Method | EditReward-Bench (K=2) | EditReward-Bench (K=3) | EditReward-Bench (K=4) | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.81 | 11.33 | 1.35 | 13.84 | | Human-to-Human | -- | -- | -- | -- | | Proprietary Models | | | | | | GPT-4o | 45.69 | 27.33 | 7.31 | 28.31 | | GPT-5 | 57.53 | 38.51 | 12.84 | 37.81 | | Gemini-2.0-Flash | 52.43 | 33.33 | 13.51 | 33.47 | | Gemini-2.5-Flash | 58.61 | 39.86 | 12.16 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 51.07 | 20.27 | 2.71 | 26.86 | | Qwen2.5-VL-7B-Inst | 52.69 | 24.67 | 3.38 | 29.75 | | Qwen2.5-VL-32B-Inst | 50.54 | 25.27 | 4.05 | 28.72 | | MiMo-VL-7B-SFT-2508 | 49.46 | 30.41 | 9.46 | 31.19 | | ADIEE | -- | -- | -- | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 56.99 | 36.00 | 10.81 | 36.78 | | EditReward (on MiMo-VL-7B) | 56.45 | 42.67 | 11.49 | 38.42 | 📚 Citation Please kindly cite our paper if you use our code, data, models or results: 🙏 Acknowledgements We would like to thank the HPSv3, VideoAlign and GenAI-Bench codebase for providing valuable references. --- ⭐ Star History [](https://star-history.com/#TIGER-AI-Lab/EditReward&Date) 💬 Support For questions and support: - Issues: GitHub Issues - Email: [email protected] & [email protected]
VL-Rethinker-32B
Vamba-Qwen2-VL-7B
TIGERScore-7B
ABC-Qwen2VL-Pretrain
ABC-Qwen2VL-Instruct
ScholarCopilot-v1
RationalRewards-8B-T2I
RationalRewards-8B-Edit
StructLM-7B
MAmmoTH-13B
Qwen2.5-32B-Instruct-CFT
MAmmoTH2-8B
VL-Reasoner-7B
General-Reasoner-Qwen3-4B
General-Reasoner: Advancing LLM Reasoning Across All Domains 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page Figure: Effectiveness of General-Reasoner trained with diverse verifiable reasoning questions using model-based verifier compared to baseline methods on various reasoning tasks. General-Reasoner is a training paradigm for large language models (LLMs), designed to robustly enhance reasoning abilities across diverse domains—not just mathematics and coding, but also physics, chemistry, finance, humanities, and more. Key features: - Zero RL Training: Direct reinforcement learning from base LLMs, bypassing intermediate supervised stages. - Diverse Reasoning Data: 230K+ high-quality, verifiable questions sourced from the web and filtered for answer verifiability across disciplines. - Model-Based Verifier: Compact 1.5B generative verifier model for context-aware, chain-of-thought answer validation, outperforming traditional rule-based methods. This specific model is the General-Reasoner variant trained based on Qwen3-4B-Base. Main Results General-Reasoner outperforms base and supervised models on a variety of reasoning benchmarks, demonstrating robust generalization across domains:
VideoScore2-RL-no-SFT
This is an ablation variant of VideoScore2, please refer to main model for more details: 🤗Model (VideoScore2)
MAmmoTH2-8x7B
General-Reasoner-Qwen2.5-14B
MAmmoTH-70B
MAmmoTH-Coder-13B
MAmmoTH2-7B
VisCoder-7B
Critique-Coder-8B
VLM2Vec-LoRA
One-Shot-CFT-Math-Qwen-14B
One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-14B with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.
Critique-Coder-4B
StructLM-34B
MAmmoTH-Coder-34B
VL-Reasoner-72B
One-Shot-CFT-Math-Qwen-7B
One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-7B-Math with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.
VideoScore-Qwen2-VL
Mantis-8B-siglip-llama3-pretraind
MAmmoTH-7B-Mistral
General-Reasoner-Qwen3-14B
VisCoder-3B
One-Shot-CFT-Math-Qwen-1.5B
One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-1.5B-Math with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.
AceCodeRM-32B
StructLM-7B-Mistral
StructLM-13B
VISTA-LongVA
This repo contains model checkpoints for VISTA-LongVA. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with
AceCoder-Qwen2.5-Coder-7B-Ins-V1.1
VISTA-VideoLLaVA
This repo contains model checkpoints for VISTA-VideoLLaVA. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with
AceCoder-Qwen2.5-Coder-7B-Ins-RM
UniIR
VISTA-Mantis
This repo contains model checkpoints for VISTA-Mantis. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with