TIGER-Lab

81 models • 1 total models in database
Sort by:

VLM2Vec-Full

license:apache-2.0
46,052
28

MAmmoTH2-8B-Plus

NaNK
llama
12,509
22

MAmmoTH2-8x7B-Plus

NaNK
license:mit
8,447
14

general-verifier

NaNK
license:apache-2.0
6,312
20

VLM2Vec-Qwen2VL-2B

NaNK
license:apache-2.0
5,664
0

PixelReasoner-RL-v1

The model is trained with curiosity-driven RL described in paper. We have released vllm based inference code at https://github.com/TIGER-AI-Lab/Pixel-Reasoner/. Project page: https://tiger-ai-lab.github.io/Pixel-Reasoner/ Github repository: https://github.com/TIGER-AI-Lab/Pixel-Reasoner/ We will release a simple hf.generate() based inference code.

license:apache-2.0
3,248
9

VideoScore-v1.1

license:mit
2,980
7

Mantis-8B-siglip-llama3

NaNK
llama3
2,227
33

VLM2Vec-Qwen2VL-7B

NaNK
license:apache-2.0
1,802
10

PixelReasoner-WarmStart

This is the model trained with https://github.com/TIGER-AI-Lab/Pixel-Reasoner/.

NaNK
license:apache-2.0
944
4

VL-Rethinker-7B

NaNK
license:apache-2.0
901
13

TIGERScore-13B

NaNK
llama
736
18

VideoScore2

📃Paper | 🌐Website | 💻Code | 🛢️Dataset (VideoFeedback2) | 🤗Model (VideoScore2) | 🤗Space (VideoScore2) 🤔Ablation1: SFT-only | 🤔Ablation2: SFT-w/o-CoT | 🤔Ablation3: RL-w/o-SFT Introduction We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc). Inference For running inference of VideoScore2, firstly install: Training (SFT and RL) see VideoScore2/training for details

NaNK
license:apache-2.0
604
3

EditReward-MiMo-VL-7B-SFT-2508

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [](https://tiger-ai-lab.github.io/EditReward/) [](https://arxiv.org/abs/2509.26346) [](https://huggingface.co/papers/2509.26346) [](https://huggingface.co/collections/TIGER-Lab/editreward-68ddf026ef9eb1510458abc6) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Bench) This repository contains the official implementation of the paper EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing. 📖 Introduction We introduce EditReward, a human-aligned reward model powered by a high-quality dataset for instruction-guided image editing. EditReward is trained with EditReward-Data, a large-scale, high-fidelity preference dataset comprising over 200K manually annotated preference pairs. This dataset covers diverse edits produced by seven state-of-the-art models across twelve distinct sources, ensuring high alignment with human judgment. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks, achieving state-of-the-art human correlation on established benchmarks like GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench. 🚀 Quick Start To use the EditReward model for inference, follow these steps. For more details, including installation and training, please refer to the GitHub Repository. 📊 Benchmark EditReward achieves superior alignment with human preferences in instruction-guided image editing tasks. The following tables show its performance against other models on various benchmarks. | Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.90 | 33.43 | -- | 13.84 | | Human-to-Human | -- | -- | 41.84 | -- | | Proprietary Models | | | | | | GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 | | GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 | | Gemini-2.0-Flash | 53.32 | 44.31 | 23.69 | 33.47 | | Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 42.76 | 30.69 | -2.54 | 26.86 | | Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 | | Qwen2.5-VL-32B-Inst | 39.28 | 37.06 | 26.87 | 28.72 | | MiMo-VL-7B-SFT-2508 | 57.89 | 30.43 | 22.14 | 31.19 | | ADIEE | 59.96 | 55.56 | 34.50 | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 63.97 | 59.50 | 36.18 | 36.78 | | EditReward (on MiMo-VL-7B) | 65.72 | 63.62 | 35.20 | 38.42 | | Method | EditReward-Bench (K=2) | EditReward-Bench (K=3) | EditReward-Bench (K=4) | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.81 | 11.33 | 1.35 | 13.84 | | Human-to-Human | -- | -- | -- | -- | | Proprietary Models | | | | | | GPT-4o | 45.69 | 27.33 | 7.31 | 28.31 | | GPT-5 | 57.53 | 38.51 | 12.84 | 37.81 | | Gemini-2.0-Flash | 52.43 | 33.33 | 13.51 | 33.47 | | Gemini-2.5-Flash | 58.61 | 39.86 | 12.16 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 51.07 | 20.27 | 2.71 | 26.86 | | Qwen2.5-VL-7B-Inst | 52.69 | 24.67 | 3.38 | 29.75 | | Qwen2.5-VL-32B-Inst | 50.54 | 25.27 | 4.05 | 28.72 | | MiMo-VL-7B-SFT-2508 | 49.46 | 30.41 | 9.46 | 31.19 | | ADIEE | -- | -- | -- | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 56.99 | 36.00 | 10.81 | 36.78 | | EditReward (on MiMo-VL-7B) | 56.45 | 42.67 | 11.49 | 38.42 | 📚 Citation Please kindly cite our paper if you use our code, data, models or results: 🙏 Acknowledgements We would like to thank the HPSv3, VideoAlign and GenAI-Bench codebase for providing valuable references. --- ⭐ Star History [](https://star-history.com/#TIGER-AI-Lab/EditReward&Date) 💬 Support For questions and support: - Issues: GitHub Issues - Email: [email protected] & [email protected]

NaNK
license:apache-2.0
378
1

ConsistI2V

license:mit
305
8

General-Reasoner-Qwen2.5-7B

NaNK
license:apache-2.0
248
3

VLM2Vec-LLaVa-Next

NaNK
license:apache-2.0
204
1

Mantis-8B-clip-llama3

NaNK
llama3
179
1

Mantis-8B-Idefics2

NaNK
llama3
168
15

MAmmoTH-Coder-7B

NaNK
llama
114
27

Mantis-llava-7b

NaNK
license:apache-2.0
101
15

Mantis-8B-Fuyu

NaNK
license:cc-by-nc-4.0
84
4

VL-Rethinker-72B

NaNK
license:apache-2.0
76
5

VisCoder2-14B

NaNK
license:apache-2.0
65
2

MAmmoTH-7B

NaNK
llama
59
8

MAmmoTH-VL2

NaNK
license:apache-2.0
54
13

VideoScore

license:apache-2.0
49
7

BrowserAgent-SFT

Model We release the SFT (Supervised Fine-Tuned) model used in BrowserAgent, based on `Qwen/Qwen2.5-7B-Instruct`. This model learns structured web-browsing behaviors—such as click, type, scroll, read, submit—from human-style demonstrations and produces schema-constrained action sequences for browser environments. Paper BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

NaNK
license:apache-2.0
49
0

VisCoder2-7B

NaNK
license:apache-2.0
47
2

VisCoder2-3B

NaNK
license:apache-2.0
45
2

VideoScore2-SFT

This is an ablation variant of VideoScore2, please refer to main model for more details: 🤗Model (VideoScore2)

NaNK
license:apache-2.0
42
0

SWE-Next-14B

NaNK
license:mit
41
0

VisCoder2-32B

NaNK
license:apache-2.0
39
1

SWE-Next-7B

NaNK
license:mit
36
0

BrowserAgent-RFT

Model We release the RFT (Reward Fine-Tuned) model used in BrowserAgent, initialized from the SFT checkpoint of `Qwen/Qwen2.5-7B-Instruct`. This model further optimizes browsing trajectories with task-level reward signals that encourage higher success rate, shorter action paths, and safer interactions. Paper BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

NaNK
license:apache-2.0
33
2

Mantis-bakllava-7b

NaNK
license:apache-2.0
31
5

EditReward-Qwen2.5-VL-7B

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing [](https://tiger-ai-lab.github.io/EditReward/) [](https://arxiv.org/abs/2509.26346) [](https://huggingface.co/papers/2509.26346) [](https://huggingface.co/collections/TIGER-Lab/editreward-68ddf026ef9eb1510458abc6) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Data) [](https://huggingface.co/datasets/TIGER-Lab/EditReward-Bench) This repository contains the official implementation of the paper EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing. 📖 Introduction We introduce EditReward, a human-aligned reward model powered by a high-quality dataset for instruction-guided image editing. EditReward is trained with EditReward-Data, a large-scale, high-fidelity preference dataset comprising over 200K manually annotated preference pairs. This dataset covers diverse edits produced by seven state-of-the-art models across twelve distinct sources, ensuring high alignment with human judgment. EditReward demonstrates superior alignment with human preferences in instruction-guided image editing tasks, achieving state-of-the-art human correlation on established benchmarks like GenAI-Bench, AURORA-Bench, ImagenHub, and our new EditReward-Bench. 🚀 Quick Start To use the EditReward model for inference, follow these steps. For more details including installation and training, please refer to the GitHub Repository. 📊 Benchmark EditReward achieves superior alignment with human preferences in instruction-guided image editing tasks. The following tables show its performance against other models on various benchmarks. | Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.90 | 33.43 | -- | 13.84 | | Human-to-Human | -- | -- | 41.84 | -- | | Proprietary Models | | | | | | GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 | | GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 | | Gemini-2.0-Flash | 53.32 | 44.31 | 23.69 | 33.47 | | Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 42.76 | 30.69 | -2.54 | 26.86 | | Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 | | Qwen2.5-VL-32B-Inst | 39.28 | 37.06 | 26.87 | 28.72 | | MiMo-VL-7B-SFT-2508 | 57.89 | 30.43 | 22.14 | 31.19 | | ADIEE | 59.96 | 55.56 | 34.50 | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 63.97 | 59.50 | 36.18 | 36.78 | | EditReward (on MiMo-VL-7B) | 65.72 | 63.62 | 35.20 | 38.42 | | Method | EditReward-Bench (K=2) | EditReward-Bench (K=3) | EditReward-Bench (K=4) | EditReward-Bench (Overall) | | :--- | :--- | :--- | :--- | :--- | | Random | 25.81 | 11.33 | 1.35 | 13.84 | | Human-to-Human | -- | -- | -- | -- | | Proprietary Models | | | | | | GPT-4o | 45.69 | 27.33 | 7.31 | 28.31 | | GPT-5 | 57.53 | 38.51 | 12.84 | 37.81 | | Gemini-2.0-Flash | 52.43 | 33.33 | 13.51 | 33.47 | | Gemini-2.5-Flash | 58.61 | 39.86 | 12.16 | 38.02 | | Open-Source VLMs | | | | | | Qwen2.5-VL-3B-Inst | 51.07 | 20.27 | 2.71 | 26.86 | | Qwen2.5-VL-7B-Inst | 52.69 | 24.67 | 3.38 | 29.75 | | Qwen2.5-VL-32B-Inst | 50.54 | 25.27 | 4.05 | 28.72 | | MiMo-VL-7B-SFT-2508 | 49.46 | 30.41 | 9.46 | 31.19 | | ADIEE | -- | -- | -- | -- | | Reward Models (Ours) | | | | | | EditReward (on Qwen2.5-VL-7B) | 56.99 | 36.00 | 10.81 | 36.78 | | EditReward (on MiMo-VL-7B) | 56.45 | 42.67 | 11.49 | 38.42 | 📚 Citation Please kindly cite our paper if you use our code, data, models or results: 🙏 Acknowledgements We would like to thank the HPSv3, VideoAlign and GenAI-Bench codebase for providing valuable references. --- ⭐ Star History [](https://star-history.com/#TIGER-AI-Lab/EditReward&Date) 💬 Support For questions and support: - Issues: GitHub Issues - Email: [email protected] & [email protected]

NaNK
license:apache-2.0
31
3

VL-Rethinker-32B

NaNK
29
1

Vamba-Qwen2-VL-7B

NaNK
license:mit
26
16

TIGERScore-7B

NaNK
llama
24
2

ABC-Qwen2VL-Pretrain

NaNK
19
1

ABC-Qwen2VL-Instruct

NaNK
license:mit
19
0

ScholarCopilot-v1

NaNK
license:apache-2.0
16
8

RationalRewards-8B-T2I

NaNK
license:apache-2.0
16
0

RationalRewards-8B-Edit

NaNK
license:apache-2.0
15
0

StructLM-7B

NaNK
llama
13
23

MAmmoTH-13B

NaNK
llama
13
9

Qwen2.5-32B-Instruct-CFT

NaNK
license:apache-2.0
13
6

MAmmoTH2-8B

NaNK
llama
11
2

VL-Reasoner-7B

NaNK
license:apache-2.0
11
1

General-Reasoner-Qwen3-4B

General-Reasoner: Advancing LLM Reasoning Across All Domains 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page Figure: Effectiveness of General-Reasoner trained with diverse verifiable reasoning questions using model-based verifier compared to baseline methods on various reasoning tasks. General-Reasoner is a training paradigm for large language models (LLMs), designed to robustly enhance reasoning abilities across diverse domains—not just mathematics and coding, but also physics, chemistry, finance, humanities, and more. Key features: - Zero RL Training: Direct reinforcement learning from base LLMs, bypassing intermediate supervised stages. - Diverse Reasoning Data: 230K+ high-quality, verifiable questions sourced from the web and filtered for answer verifiability across disciplines. - Model-Based Verifier: Compact 1.5B generative verifier model for context-aware, chain-of-thought answer validation, outperforming traditional rule-based methods. This specific model is the General-Reasoner variant trained based on Qwen3-4B-Base. Main Results General-Reasoner outperforms base and supervised models on a variety of reasoning benchmarks, demonstrating robust generalization across domains:

NaNK
license:apache-2.0
10
3

VideoScore2-RL-no-SFT

This is an ablation variant of VideoScore2, please refer to main model for more details: 🤗Model (VideoScore2)

NaNK
license:apache-2.0
10
1

MAmmoTH2-8x7B

NaNK
license:mit
10
0

General-Reasoner-Qwen2.5-14B

NaNK
license:apache-2.0
9
5

MAmmoTH-70B

NaNK
llama
8
10

MAmmoTH-Coder-13B

NaNK
llama
8
8

MAmmoTH2-7B

NaNK
license:mit
7
0

VisCoder-7B

NaNK
license:apache-2.0
6
8

Critique-Coder-8B

NaNK
license:apache-2.0
6
2

VLM2Vec-LoRA

NaNK
license:apache-2.0
5
11

One-Shot-CFT-Math-Qwen-14B

One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-14B with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.

NaNK
license:cc-by-4.0
5
2

Critique-Coder-4B

NaNK
license:apache-2.0
5
0

StructLM-34B

NaNK
llama
4
15

MAmmoTH-Coder-34B

NaNK
llama
4
7

VL-Reasoner-72B

NaNK
license:apache-2.0
4
3

One-Shot-CFT-Math-Qwen-7B

One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-7B-Math with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.

NaNK
license:cc-by-4.0
4
2

VideoScore-Qwen2-VL

4
1

Mantis-8B-siglip-llama3-pretraind

NaNK
license:llama3
4
0

MAmmoTH-7B-Mistral

NaNK
license:mit
3
7

General-Reasoner-Qwen3-14B

NaNK
license:apache-2.0
3
6

VisCoder-3B

NaNK
license:apache-2.0
3
3

One-Shot-CFT-Math-Qwen-1.5B

One-Shot-CFT: Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem 💻 Code | 📄 Paper | 📊 Dataset | 🤗 Model | 🌐 Project Page One-Shot Critique Fine-Tuning (CFT) is a simple, robust, and compute-efficient training paradigm for unleashing the reasoning capabilities of pretrained LLMs in both mathematical and logical domains. By leveraging critiques on just one problem, One-Shot CFT enables models like Qwen and LLaMA to match or even outperform reinforcement learning, while using 20× less compute. Instead of learning from reference answers (as in supervised fine-tuning) or reward signals (as in reinforcement learning), One-Shot CFT enables models to learn from critiques of diverse solutions to a single problem, enhancing their exposure to varied reasoning patterns and mitigating overfitting. This exposes the LLMs to multiple perspectives and error types, thereby more effectively unleashing their reasoning potential. - Unleashes Reasoning with One Example: One-Shot CFT uses critiques of diverse model-generated solutions to a single problem to significantly boost performance across math and logic tasks. For example, with just 5 GPU hours of training on Qwen2.5-Math-7B, One-Shot CFT achieves an average improvement of +15% on six math benchmarks and +16% on three logic reasoning benchmarks. - Outperforms RLVR and Full SFT with 20× Less Compute: One-Shot CFT outperforms both one-shot Reinforcement Learning with Verifiable Rewards (RLVR) and full-dataset supervised fine-tuning, while requiring only 5 GPU hours on a 7B model—offering a much more efficient and stable training alternative. - Robust Across Seeds and Model Scales: One-Shot CFT remains effective across different seed problem choices and model sizes—from 1.5B to 14B parameters—demonstrating strong generalization and scalability. This specific model is the One-Shot CFT variant trained based on Qwen2.5-1.5B-Math with DSR-CFT-p0 dataset. One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy on six mathematical reasoning benchmarks for Qwen and LLaMA models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard SFT and matches or exceeds reinforcement learning with much lower compute.

NaNK
license:cc-by-4.0
3
0

AceCodeRM-32B

NaNK
license:mit
2
8

StructLM-7B-Mistral

NaNK
license:mit
2
6

StructLM-13B

NaNK
llama
1
9

VISTA-LongVA

This repo contains model checkpoints for VISTA-LongVA. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with

license:mit
1
2

AceCoder-Qwen2.5-Coder-7B-Ins-V1.1

NaNK
license:mit
1
1

VISTA-VideoLLaVA

This repo contains model checkpoints for VISTA-VideoLLaVA. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with

license:mit
1
0

AceCoder-Qwen2.5-Coder-7B-Ins-RM

NaNK
license:mit
1
0

UniIR

license:mit
0
6

VISTA-Mantis

This repo contains model checkpoints for VISTA-Mantis. VISTA is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs. 🌐 Homepage | 📖 arXiv | 💻 GitHub | 🤗 VISTA-400K | 🤗 Models | 🤗 HRVideoBench VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs. Citation If you find our paper useful, please cite us with

license:mit
0
1