lmms-lab

61 models • 3 total models in database

Sort by:

llava-onevision-qwen2-7b-ov

1. Model Summary 2. Use 3. Limitations 4. Training 5. License 6. Citation The LLaVA-OneVision models are 0.5/7/72B parameter models trained on LLaVA-OneVision, based on Qwen2 language model with a context window of 32K tokens. - Repository: LLaVA-VL/LLaVA-NeXT - Project Website: llava-onevision.lmms-lab.com - Paper: LLaVA-OneVision - Point of Contact: Bo Li - Languages: English, Chinese The model was trained on LLaVA-OneVision Dataset and have the ability to interact with images, multi-image and videos. Feel free to share your generations in the Community tab! We provide the simple generation process for using our model. For more details, you could refer to Github. - Architecture: SO400M + Qwen2 - Pretraining Stage: LCS-558K, 1 epoch, projector - Mid Stage: A mixture of 4.7M high-quality synthetic data, 1 epoch, full model - Final-Image Stage: A mixture of 3.6M single-image data, 1 epoch, full model - OneVision Stage: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model - Precision: bfloat16 - GPUs: 256 Nvidia Tesla A100 (for whole model series training) - Orchestration: Huggingface Trainer - Neural networks: PyTorch

license:apache-2.0

LLaVA-Video-7B-Qwen2

1. Model Summary 2. Use 3. Limitations 4. Training 5. License 6. Citation The LLaVA-Video models are 7/72B parameter models trained on LLaVA-Video-178K and LLaVA-OneVision Dataset, based on Qwen2 language model with a context window of 32K tokens. - Project Page: Project Page. - Paper: For more details, please check our paper - Repository: LLaVA-VL/LLaVA-NeXT - Point of Contact: Yuanhan Zhang - Languages: English, Chinese The model was trained on LLaVA-Video-178K and LLaVA-OneVision Dataset, having the ability to interact with images, multi-image and videos, but specific to videos. Feel free to share your generations in the Community tab! We provide the simple generation process for using our model. For more details, you could refer to Github. - Architecture: SO400M + Qwen2 - Initialized Model: lmms-lab/llava-onevision-qwen2-7b-si - Data: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model - Precision: bfloat16 - GPUs: 256 Nvidia Tesla A100 (for whole model series training) - Orchestration: Huggingface Trainer - Neural networks: PyTorch

license:apache-2.0

llava-onevision-qwen2-0.5b-ov

1. Model Summary 2. Use 3. Limitations 4. Training 5. License 6. Citation The LLaVA-OneVision models are 0.5/7/72B parameter models trained on LLaVA-OneVision, based on Qwen2 language model with a context window of 32K tokens. - Repository: LLaVA-VL/LLaVA-NeXT - Project Website: llava-onevision.lmms-lab.com - Paper: [LLaVA-OneVision]() - Point of Contact: Bo Li - Languages: English, Chinese The model was trained on LLaVA-OneVision Dataset and have the ability to interact with images, multi-image and videos. Feel free to share your generations in the Community tab! We provide the simple generation process for using our model. For more details, you could refer to Github. - Architecture: SO400M + Qwen2 - Pretraining Stage: LCS-558K, 1 epoch, projector - Mid Stage: A mixture of 4.7M high-quality synthetic data, 1 epoch, full model - Final-Image Stage: A mixture of 3.6M single-image data, 1 epoch, full model - OneVision Stage: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model - Precision: bfloat16 - GPUs: 256 Nvidia Tesla A100 (for whole model series training) - Orchestration: Huggingface Trainer - Neural networks: PyTorch

license:apache-2.0

LLaVA-NeXT-Video-32B-Qwen

In our LLaVA-Video blog released this April, we shared two key observations: - 🎬 AnyRes provides a shared and flexible representation between images and videos, and thus accommodates capability transfer between the two most common vision signals. Therefore, stronger image LMMs can naturally lead to stronger zero-shot video LMMs. - 🗂️ There is a lack of high-quality language-video data, including video instruction-following data, and thus naive tuning on existing public data at that time results in performance degradation. Therefore, there is an urgent need to build high-quality video captions and QA datasets to train LMMs for improved video performance. Based on the insights, the new LLaVA-NeXT-Video in this release improves from two aspects: - 🎬 A stronger image LMMs (LLaVA-NeXT-32B-Qwen), which is built by initializing from Qwen-1.5 32B LLM. We further initialize our video training from this image checkpoint. - 🗂️ A new high-quality video dataset with 830k samples. It is combined with LLaVA-1.6 image training data, and applying the same image-video mixed training procedure leads to the new video model. The new model achieves the best open-source performance in several video benchmarks including Video-MME. Evaluation Results | Model | NextQA-MC | video-mme(overall) | | Egochema | Perception Test (val) | |-----------------------------|-----------|--------------------|--------|----------|------------------------| | | | w/o subs | w subs | | | | Proprietary | | | | | | | GPT-4o | - | 71.9 | 77.2 | 72.2 | - | | Gemini 1.5 Pro | - | 75.0 | 81.3 | 72.2 | - | | Open-Source | | | | | | | VideoLLaMA 2 (8x7B) | 76.3 | 47.9 | 50.3 | 53.3 | 51.2 | | VILA-1.5-34B | 67.89 | 60.1 | 61.1 | 58.04 | 54 | | LLaVA-NeXT-Video (Qwen-32B) | 77.31 | 60.2 | 63.0 | 60.85 | 59.38 | Results are reproduced by lmms-eval. Please refer to the lmms-eval to reproduce the results. LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. LLaVA-NeXT-Video-32B-Qwen was trained in June 2024. Where to send questions or comments about the model https://github.com/LLaVA-VL/LLaVA-NeXT/issues The primary use of LLaVA is research on large multimodal models and chatbots. The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Image - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. - 158K GPT-generated multimodal instruction-following data. - 500K academic-task-oriented VQA data mixture. - 50K GPT-4V data mixture. - 40K ShareGPT data. Video - 830k data @misc{zhang2024videoinstructiontuningsynthetic, title={Video Instruction Tuning With Synthetic Data}, author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li}, year={2024}, eprint={2410.02713}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.02713}, } @misc{zhang2024llavanextvideo, title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model}, url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/}, author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan}, month={April}, year={2024} } @misc{li2024llavanext-interleave, title={LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models}, url={https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/}, author={Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan}, month={June}, year={2024} }

license:apache-2.0

LLaVA-OneVision-1.5-8B-Instruct

LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model LLaVA-OneVision-1.5 is a fully open-source family of large multimodal models (LMMs) built to democratize multimodal training. Trained on native‑resolution images, it delivers state‑of‑the‑art performance at substantially lower cost. The project also releases high‑quality pretraining and SFT data, a complete and efficient training framework with recipes and configs, and comprehensive logs to support transparent, reproducible research. Superior Performance - The model leads on multiple multimodal benchmarks and generally surpasses Qwen2.5-VL. - Training on native-resolution images significantly improves its visual understanding. High-Quality Data at Scale - The pretraining corpus comprises large-scale, concept-balanced, diverse, and high-quality captions curated with strict filtering and quality control. - The instruction-tuning dataset is comprehensive and covers a wide range of tasks. Ultra-Efficient Training Framework - The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour. - The system is built on Megatron-LM with support for MoE, FP8, and long-sequence parallelism, and the codebase is optimized for cost-effective scaling. Fully Open Framework - The project releases high-quality pretraining and SFT datasets along with the complete training framework, configurations, and recipes. - It also provides detailed training logs and metrics to enable reproducibility and community adoption. | Model | HF Link | Training Log | |--------------------------|--------------------------------------------------------------------------------------------------------|-------------| | LLaVA-OV-1.5-4B-Instruct | 🤗 HF / 4B-Instruct | 📈 Tensorboard | | LLaVA-OV-1.5-8B-Instruct | 🤗 HF / 8B-Instruct | 📈 Tensorboard | | Description | Link | Status | |--------------------|--------------------------------------------------------------------------------------------------------|-------------| | LLaVA-OneVision-1.5-Mid-Training-85M | 🤗HF / Mid-Training 85M | Uploading… | | LLaVA-OneVision-1.5-Instruct | 🤗HF / Instruct-Data | Available | Evaluation Results All evaluations were conducted using lmmseval. | | LLaVA-OV-1.5-8B | Qwen2.5 VL 7B | |:----------------------------------|:---------------:|:-------------:| | MMMU (Validation) | 55.44 | 51.33 | | MMMU-Pro (Standard) | 37.40 | 36.30 | | MMMU-Pro (Vision) | 25.15 | 32.83 | | MMBench (English; Test) | 84.14 | 83.40 | | MMBench (Chinese; Test) | 81.00 | 81.61 | | MME-RealWorld (English) | 62.31 | 57.33 | | MME-RealWorld (Chinese) | 56.11 | 51.50 | | AI2D (With Mask) | 84.16 | 82.58 | | AI2D (Without Mask) | 94.11 | 93.36 | | CV-Bench | 80.82 | 79.95 | | VL-RewardBench | 45.90 | 49.65 | | V | 78.01 | 76.96 | | PixmoCount | 62.19 | 63.33 | | CountBench | 88.19 | 86.35 | | ChartQA | 86.48 | 84.08 | | CharXiv (Direct Questions) | 74.10 | 69.80 | | DocVQA (Test) | 95.00 | 94.93 | | InfoVQA (Test) | 78.42 | 81.67 | | WeMath | 33.62 | 33.33 | | MathVista (Mini) | 69.57 | 68.60 | | MathVision | 25.56 | 22.37 | | MMStar | 67.72 | 62.54 | | SEED-Bench (Image) | 77.32 | 77.53 | | ScienceQA | 94.98 | 88.75 | | SEED-Bench 2-Plus | 69.21 | 70.93 | | OCRBench | 82.90 | 84.20 | | RealWorldQA | 68.10 | 68.50 | Using 🤗 Transformers to Chat Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers: We extend our sincere gratitude to AIAK team of the Baige AI computing platform from Baidu AI Cloud for providing the exceptional training framework. The outstanding capabilities of AIAK-Training-LLM and AIAK-Megatron have significantly accelerated our training process with remarkable efficiency. These cutting-edge frameworks have been instrumental in achieving our research goals. `To get full AIAK support, you can contact Baidu Cloud.` We acknowledge the support of Synvo AI for contributing to the partial data annotation in this work, and also thank the maintainers and contributors of the following open-source projects, whose work greatly inspired and supported our research: - LLaVA: Large Language-and-Vision Assistant — LLaVA - LLaVA-NeXT: Next-generation multi-modal assistant — LLaVA-NeXT - lmms-eval: A standardized evaluation framework for Large Multimodal Models — lmms-eval - Megatron-LM: Efficient, scalable training for large language models — Megatron-LM - Qwen2.5-VL: Strong vision-language foundation model — Qwen2.5-VL - InternVL: Open-source large-scale vision-language foundation model — InternVL - Qwen3: Next-generation Qwen LLM — Qwen - MetaCLIP: Scalable contrastive pretraining — MetaCLIP - FineVision: Open Data Is All You Need — FineVision

license:apache-2.0

llama3-llava-next-8b

llava-onevision-qwen2-7b-si

license:apache-2.0

LLaVA-NeXT-Video-7B-DPO

BAGEL-7B-MoT-ver.LE

This repository contains converted weights from the Bagel family of models, adapted for use with the lmms-engine.

license:apache-2.0

llava-onevision-qwen2-0.5b-si

license:apache-2.0

llava-onevision-qwen2-7b-ov-chat

license:apache-2.0

LLaVA-OneVision-1.5-4B-Instruct

license:apache-2.0

LongVA-7B-DPO

llava-critic-7b

license:apache-2.0

llava-onevision-qwen2-72b-ov-sft

license:apache-2.0

llava-critic-72b

license:apache-2.0

LLaVA-Video-72B-Qwen2

license:apache-2.0

llava-onevision-qwen2-0.5b-mid-stage-a4

llava-next-interleave-qwen-0.5b

LongVA-7B

llava-next-interleave-qwen-7b-dpo

LLaVA-Critic-R1-7B-Plus-Qwen

LLaVA-NeXT-Video-7B

LLaVA-Critic-R1-7B

llava-next-interleave-qwen-7b

EgoGPT-7b-EgoIT-EgoLife

license:apache-2.0

LLaVA-Video-7B-Qwen2-Video-Only

license:apache-2.0

llava-next-qwen-32b

license:apache-2.0

llava-onevision-qwen2-7b-mid-stage-a4

LLaVA-OneVision-1.5-4B-Base

license:apache-2.0

Qwen2-7B-Instruct-224K

Aero-1-Audio

Qwen2-VL-2B-GRPO-8k

MMSearch-R1-7B

license:apache-2.0

LLaVA-OneVision-1.5-8B-Base

LLaVA-OneVision-1.5: Fully Open-Source State-of-the-Art VLM Model LLaVA-OneVision1.5 introduces a novel family of fully open-source Large Multimodal Models (LMMs) that achieves state-of-the-art performance with substantially lower cost through training on native resolution images. - Superior Performance A family of fully open-source large multimodal models demonstrating - Superior performance across multiple multimodal benchmarks - outperforming Qwen2.5-VL in most evaluation tasks. - High-Quality Data at Scale Meticulously curated pre-training and SFT data with rigorous filtering and quality control, achieving superior data efficiency with only 64B tokens. - Concept-balanced, highly diverse, high-quality caption data - Comprehensive instruction fine-tuning data covering a wide range of tasks - Ultra-Efficient Training Framework Complete end-to-end training framework designed for maximum efficiency: - $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour) - 45% HFU efficiency in 8k context length - Built on MegatronLM with support for MoE, FP8, and long sequence parallelization - Optimized codebase for cost-effective scaling - Fully Open Framework for community access and reproducibility: - High-quality pre-training & SFT data - Complete training framework & code - Training recipes & configurations - Comprehensive training logs & metrics If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:

license:apache-2.0

LLaVA-OneVision-1.5-4B-stage0

This model provides an initialization checkpoint for training LLaVA-OneVision-1.5, designed to combine strong language and vision capabilities. It integrates a powerful LLM and a state-of-the-art vision encoder, with a flexible adapter to enable efficient multimodal learning. - Vision Encoder: Uses the pretrained ViT model from DeepGlint-AI/rice-vit-large-patch14-560 to extract rich visual features. - Adapter: A randomly initialized adapter module with 4× token compression capability, enabling efficient fusion of image and text modalities. - Language Model: Incorporates the pretrained language model Qwen/Qwen3-4B-Instruct-2507 for robust text understanding and generation. This initialization checkpoint is intended for downstream training and fine-tuning. For usage and training scripts, please refer to the EvolvingLMMs-Lab/LLaVA-OneVision-1.5 repository. - DeepGlint-AI/rice-vit-large-patch14-560 - Qwen/Qwen3-4B-Instruct-2507 - EvolvingLMMs-Lab/LLaVA-OneVision-1.5 If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:

license:apache-2.0

llama3-llava-next-8b-hf-sae-131k

llava-onevision-qwen2-72b-si

license:apache-2.0

EgoGPT-0.5b-Demo

license:apache-2.0

LLaVA-Critic-R1-7B-Plus-Mimo

llava-onevision-qwen2-72b-ov-chat

license:apache-2.0

Qwen2-VL-7B-GRPO-8k

llava-next-72b

LLaVA-NeXT-Video-7B-32K

license:apache-2.0

llavanext-qwen-tokenizer

LLaVA-NeXT-Video-34B-DPO

MMSearch-R1-7B-0807

Introduction MMSearch-R1-7B is a search-augmented LMM trained with end-to-end reinforcement learning, equipped with the ability to invoke multimodal search tools on demand. In 2025-08, we update this model by integrating improved reasoning capabilities. Please check our blog. Model Details - Model name: MMSearch-R1-7B-0807 - Architecture: Qwen2.5-VL-7B base model, fine-tuned with Reinforcement Learning (GRPO) - Model type: Multimodal Large Language Model with Search-Augmentation - Languages: English(primary), multilingual(partially) - License: Apache license 2.0 - Paper: MMSearch-R1: Incentivizing LMMs to Search - Code: EvolvingLMMs-Lab/multimodal-search-r1 | Models | MMK12 | MathVerse (testmini) | MathVision (testmini) | MathVista (testmini) | MMMU (val) | AI2D | ChartQA | MME | RealworldQA | OCRBench | DocVQA | MMBench | MMStar | MiaBench | |--------|-------|----------------------|----------------------|----------------------|------------|------|---------|-----|-------------|----------|--------|---------|--------|----------| | Qwen2.5-VL-7B | 34.4 | 46.2 | 24.0 | 66.6 | 49.8 | 93.3 | 94.4 | 630.4/1685.2 | 68.5 | 85.2 | 94.6 | 82.9 | 62.6 | 81.7 | | General Search | 43.6 | 52.0 | 27.3 | 74.7 | 56.1 | 94.6 | 94.0 | 718.9/1775.3 | 65.5 | 77.8 | 89.4 | 84.0 | 60.4 | 44.4 | | Models | Infoseek | MMSearch | FVQA | SimpleVQA | |--------|----------|----------|------|-----------| | Qwen2.5-VL-7B | 20.1 | 12.8 | 20.3 | 38.4 | | MMSearch | 55.1 | 53.8 | 58.4 | 57.4 | | General Search | 52.0 | 54.9 | 52.8 | 57.0 |

license:apache-2.0

LLaVA-OneVision-1.5-8B-stage0

This model provides an initialization checkpoint for training LLaVA-OneVision-1.5, designed to combine strong language and vision capabilities. It integrates a powerful LLM and a state-of-the-art vision encoder, with a flexible adapter to enable efficient multimodal learning. - Vision Encoder: Uses the pretrained ViT model from DeepGlint-AI/rice-vit-large-patch14-560 to extract rich visual features. - Adapter: A randomly initialized adapter module with 4× token compression capability, enabling efficient fusion of image and text modalities. - Language Model: Incorporates the pretrained language model Qwen/Qwen3-8B-Base for robust text understanding and generation. This initialization checkpoint is intended for downstream training and fine-tuning. For usage and training scripts, please refer to the EvolvingLMMs-Lab/LLaVA-OneVision-1.5 repository. - DeepGlint-AI/rice-vit-large-patch14-560 - Qwen/Qwen3-8B-Base - EvolvingLMMs-Lab/LLaVA-OneVision-1.5 If you find LLaVA-OneVision-1.5 useful in your research, please consider to cite the following related papers:

license:apache-2.0

MovieChat-ckpt

llava-next-110b

llava-next-vicuna-v1.5-7b-s2

llavanext-qwen-siglip-tokenizer

llava-onevision-qwen2-72b-mid-stage-a4

EgoGPT-7b-EgoIT

license:apache-2.0

LLaVA-Critic-R1-7B-LLaMA32v

LLaVA-NeXT-Video-34B

license:apache-2.0

llava-mistral-7b-tokenizer

EgoGPT-7b-Demo

license:apache-2.0

llama3-llava-next-8b-tokenizer

llava-onevision-projectors

license:apache-2.0

PG_Video_LLaVA-projector