omni-research

4 models • 1 total models in database

Sort by:

Tarsier2-Recap-7b

Tarsier Model Card Introduction Tarsier2-Recap-7b is build upon Qwen2-VL-7B-Instruct by distilling the video description capabilities of Tarsier2-7b. Specifically, we finetuned Qwen2-VL-7B-Instruct on Tarsier2-Recap-585K for 2 epochs with a learning rate of 2e-5. Tarsier2-Recap-7b shares a similar video captioning ability as Tarsier2-7b, reaching an overall F1 score of 40.7% on DREAM-1K, which is only behind Tarsier2-7b (42.0%) and surpasses GPT-4o's 39.2%. See the Tarsier2 technical report for more details. Note: Please use Tarsier2-7b if you need the full-blooded Tarsier2. Model details - Base Model: Qwen2-VL-7B-Instruct - Training Data: Tarsier2-Recap-585K Model date: Tarsier2-Recap-7b was trained in December 2024. Paper or resources for more information: - github repo: https://github.com/bytedance/tarsier/tree/tarsier2 - paper link: https://arxiv.org/abs/2501.07888 - leaderboard: https://tarsier-vlm.github.io/ Intended use Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Model Performance Video Description We evaluate Tarsier2-Recap-7b on DREAM-1K, a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. Here is the evaluation result: Note: The results of Tarsier2-Recap-7b is different from the results we reported in Table 11 in the Tarsier2 technical report, as Tarsier2-Recap-7b is more fully trained (2 epochs vs 1 epoch). Video Question-Answering We evalute Tarsier2-Recap-7b on TVBench, a novel multiple-choice question-answering which requires a high level of temporal understanding. As Tarsier2-Recap-7b is only trained with video caption data, it needs some additional prompt to enduce it to conduct multi-choice question-answering tasks, see TVBench samples as an example. Here is the evaluation result: | Task | Tarsier2-Recap-7b | Tarsier2-7b | | ------- | :--------: | :-------: | | Action Antonym | 91.2 | 94.1 | | Action Count | 43.1 | 40.5 | | Action Localization | 42.5 | 37.5 | | Action Sequence | 70.5 | 72.3 | | Egocentric Sequence | 22.0 | 24.5 | | Moving Direction | 37.1 | 33.2 | | Object Count | 46.6 | 62.8 | | Object Shuffle | 36.9 | 31.6 | | Scene Transition | 85.9 | 88.1 | | Unexpected Action | 28.0 | 41.5 | | OVERALL | 54.0 | 54.7 | How to Use see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage. Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

—

10,261,999

Tarsier-7b

Tarsier Model Card Model details Model type: Tarsier-7b is one of the Tarsier family -- an open-source large-scale video-language models, which is designed to generate high-quality video descriptions, together with good capability of general video understanding (Tarsier-34b gains SOTA results on 6 open benchmarks). Base LLM: liuhaotian/llava-v1.6-vicuna-7b Paper or resources for more information: - github repo: https://github.com/bytedance/tarsier - paper link: https://arxiv.org/abs/2407.00634 Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues Intended use Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Training dataset Tarsier tasks a two-stage training strategy. - Stage-1: Multi-task Pre-training on 13M data - Stage-2: Multi-grained Instruction Tuning on 500K data In both stages, we freeze ViT and train all the parameters of projection layer and LLM. Evaluation dataset - A challenging video desription dataset: DREAM-1K - Multi-choice VQA: MVBench, NeXT-QA and Egoschema - Open-ended VQA: MSVD-QA, MSR-VTT-QA, ActivityNet-QA and TGIF-QA - Video Caption: MSVD-Caption, MSRVTT-Caption, VATEX How to Use see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage

NaNK

license:llama2

40,660

Tarsier2-7b-0115

Tarsier Model Card Introduction We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation. Compared to Tarsier-7B, Tarsier2-7B is comprehensively upgraded in base model (Qwen2-VL-7B) and training data & stage: - Pre-train: We scale up the training data to 40M video-text pairs, featuring in both volume and diversity. - SFT: Fine-grained temporal alignment is performed during supervised fine-tuning. - DPO: Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Model details - Base Model: Qwen2-VL-7B-Instruct - Training Data: - Pre-train: Over 40M samples of the mixture of video, image and text data, with 20.4M open-source and 19.8M in-house. Detailed as following: Figure 1: Summary of datasets used in the pre-training stage of Tarsier2. - Post-train: 150K human-annotated detailed video descriptions for SFT and 20K automatically sampled and filtered preference pairs for DPO. Model date: Tarsier2-Recap-7b was trained in December 2024. Paper or resources for more information: - online demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b - github repo: https://github.com/bytedance/tarsier/tree/tarsier2 - paper link: https://arxiv.org/abs/2501.07888 - leaderboard: https://tarsier-vlm.github.io/ Performace Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc. Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o. Intended use Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. How to Use see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage. Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues Citation If you find our work helpful, feel free to cite us as:

NaNK

license:apache-2.0

13,558

Tarsier-34b

NaNK

license:apache-2.0