llava-hf

29 models • 5 total models in database

Sort by:

llava-1.5-7b-hf

--- language: - en datasets: - liuhaotian/LLaVA-Instruct-150K pipeline_tag: image-text-to-text arxiv: 2304.08485 license: llama2 tags: - vision - image-text-to-text ---

llava-v1.6-mistral-7b-hf

--- license: apache-2.0 tags: - vision - image-text-to-text language: - en pipeline_tag: image-text-to-text inference: true ---

llava-onevision-qwen2-0.5b-ov-hf

--- language: - en - zh license: apache-2.0 tags: - vision - image-text-to-text - transformers.js datasets: - lmms-lab/LLaVA-OneVision-Data pipeline_tag: image-text-to-text arxiv: 2408.03326 library_name: transformers ---

NaNK

license:apache-2.0

258,377

LLaVA-NeXT-Video-7B-hf

--- language: - en license: llama2 pipeline_tag: video-text-to-text datasets: - lmms-lab/VideoChatGPT ---

llava-onevision-qwen2-7b-ov-hf

Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [](https://colab.research.google.com/drive/1-4AtYjR8UMtCALV0AswU1kiNkWCLTALT?usp=sharing) Below is the model card of 7B LLaVA-Onevision model which is copied from the original LLaVA-Onevision model card that you can find here. Model type: LLaVA-Onevision is an open-source multimodal LLM trained by fine-tuning Qwen2 on GPT-generated multimodal instruction-following data. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos. Model date: LLaVA-Onevision-7b-si was added in August 2024. Paper or resources for more information: https://llava-vl.github.io/ - Architecture: SO400M + Qwen2 - Pretraining Stage: LCS-558K, 1 epoch, projector - Mid Stage: A mixture of 4.7M high-quality synthetic data, 1 epoch, full model - Final-Image Stage: A mixture of 3.6M single-image data, 1 epoch, full model - OneVision Stage: A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model - Precision: bfloat16 First, make sure to have `transformers` installed from branch or `transformers >= 4.45.0`. The model supports multi-image and multi-prompt generation. Meaning that you can pass multiple images in your prompt. Make sure also to follow the correct prompt template by applyong chat template: Below we used `"llava-hf/llava-onevision-qwen2-7b-ov-hf"` checkpoint. Below is an example script to run generation in `float16` precision on a GPU device: First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with: Use Flash-Attention 2 to further speed-up generation First make sure to install `flash-attn`. Refer to the original repository of Flash Attention regarding that package installation. Simply change the snippet above with:

NaNK

license:apache-2.0

50,629

llava-hf

llava-1.5-7b-hf

llava-v1.6-mistral-7b-hf

llava-onevision-qwen2-0.5b-ov-hf

LLaVA-NeXT-Video-7B-hf

llava-onevision-qwen2-7b-ov-hf

llava-v1.6-vicuna-7b-hf

llama3-llava-next-8b-hf

llava-1.5-13b-hf

llava-interleave-qwen-0.5b-hf

llava-v1.6-vicuna-13b-hf

bakLlava-v1-hf

llava-v1.6-34b-hf

llava-onevision-qwen2-0.5b-si-hf

LLaVA-NeXT-Video-7B-DPO-hf

vip-llava-7b-hf

llava-onevision-qwen2-7b-ov-chat-hf

llava-onevision-qwen2-7b-si-hf

llava-interleave-qwen-7b-hf

LLaVA-NeXT-Video-34B-hf

llava-onevision-qwen2-72b-ov-hf

llava-next-72b-hf

LLaVA-Next-Video-7B-Qwen2-hf

llava-next-110b-hf

LLaVA-NeXT-Video-34B-DPO-hf

llava-onevision-qwen2-72b-ov-chat-hf

llava-interleave-qwen-7b-dpo-hf

LLaVA-NeXT-Video-7B-32K-hf

llava-onevision-qwen2-72b-si-hf

vip-llava-13b-hf