ByteDance

✓ VerifiedEnterprise

ByteDance AI, parent company of TikTok

46 models • 20 total models in database

Sort by:

Sa2VA-4B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/magic-research/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-1B | InternVL2.5-1B | Qwen2.5-0.5B-Instruct | 🤗 link | | Sa2VA-4B | InternVL2.5-4B | Qwen2.5-3B-Instruct | 🤗 link | | Sa2VA-8B | InternVL2.5-8B | internlm25-7b-chat | 🤗 link | | Sa2VA-26B | InternVL2.5-26B | internlm25-20b-chat | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-1B | 1504/434 | 71.9 | 79.6 | 73.6 | 77.7 | 53.4 | 69.5 | | Sa2VA-4B | 1691/610 | 81.8 | 82.4 | 77.6 | 79.7 | 55.9 | 73.7 | | Sa2VA-8B | 1690/610 | 84.4 | 82.6 | 78.0 | 80.3 | 58.9 | 75.9 | | Sa2VA-26B | 1698/653 | 85.8 | 82.9 | 79.3 | 81.2 | 61.8 | 78.6 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

SDXL-Lightning

SDXL-Lightning is a lightning-fast text-to-image generation model. It can generate high-quality 1024px images in a few steps. For more information, please refer to our research paper: SDXL-Lightnin...

Hyper-SD

--- library_name: diffusers inference: false tags: - lora - text-to-image - stable-diffusion - flux base_model: black-forest-labs/FLUX.1-dev ---

LatentSync-1.6

Many people have reported that the teeth and lips generated by LatentSync 1.5 are blurry. To address this issue, we trained LatentSync 1.6 on 512 x 512 resolution videos. Notably, we did not make any changes to the model structure or training strategy; the only modification was upgrading the training dataset to 512 x 512 videos. Therefore, the current code is compatible with both LatentSync 1.5 and 1.6. To switch between versions, you only need to load the corresponding checkpoint and modify the `resolution` parameter in the U-Net config file. You can view the demo in LatentSync's official GitHub repo.

AnimateDiff-Lightning

AnimateDiff-Lightning is a lightning-fast text-to-video generation model. It can generate videos more than ten times faster than the original AnimateDiff. For more information, please refer to our ...

Ouro-1.4B

license:apache-2.0

Ouro-2.6B-Thinking

license:apache-2.0

LatentSync-1.5

Dolphin-1.5

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables. Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach: 1. 🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order 2. 🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. Dolphin is built on a vision-encoder-decoder architecture using transformers: - Vision Encoder: Based on Swin Transformer for extracting visual features from document images - Text Decoder: Based on MBart for decoding text from visual features - Prompt-based interface: Uses natural language prompts to control parsing tasks The model is implemented as a Hugging Face `VisionEncoderDecoderModel` for easy integration with the Transformers ecosystem. Our demo will be released in these days. Please keep tuned! 🔥 Please refer to our GitHub repository for detailed usage. - Page-wise parsing: for an entire document image - Element-wise parsing: for an element (paragraph, table, formula) image This model builds on several open-source projects including: - Hugging Face Transformers - Donut - Nougat - Swin Transformer

Dolphin

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model that follows an analyze-then-parse paradigm. It addresses the challenges of complex document understanding through a two-stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables. Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach: 1. 🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order 2. 🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. Dolphin is built on a vision-encoder-decoder architecture using transformers: - Vision Encoder: Based on Swin Transformer for extracting visual features from document images - Text Decoder: Based on MBart for decoding text from visual features - Prompt-based interface: Uses natural language prompts to control parsing tasks The model is implemented as a Hugging Face `VisionEncoderDecoderModel` for easy integration with the Transformers ecosystem. Our demo will be released in these days. Please keep tuned! 🔥 Please refer to our GitHub repository for detailed usage. - Page-wise parsing: for an entire document image - Element-wise parsing: for an element (paragraph, table, formula) image This model builds on several open-source projects including: - Hugging Face Transformers - Donut - Nougat - Swin Transformer

InfiniteYou

license:cc-by-nc-4.0

Ouro-1.4B-Thinking

license:apache-2.0

Sa2VA-8B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/magic-research/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-1B | InternVL2.5-1B | Qwen2.5-0.5B-Instruct | 🤗 link | | Sa2VA-4B | InternVL2.5-4B | Qwen2.5-3B-Instruct | 🤗 link | | Sa2VA-8B | InternVL2.5-8B | internlm25-7b-chat | 🤗 link | | Sa2VA-26B | InternVL2.5-26B | internlm25-20b-chat | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-1B | 1504/434 | 71.9 | 79.6 | 73.6 | 77.7 | 53.4 | 69.5 | | Sa2VA-4B | 1691/610 | 81.8 | 82.4 | 77.6 | 79.7 | 55.9 | 73.7 | | Sa2VA-8B | 1690/610 | 84.4 | 82.6 | 78.0 | 80.3 | 58.9 | 75.9 | | Sa2VA-26B | 1698/653 | 85.8 | 82.9 | 79.3 | 81.2 | 61.8 | 78.6 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

Sa2VA-1B

license:apache-2.0

Sa2VA-Qwen2_5-VL-3B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2.5-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5-VL and InternVL3. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | | Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | | Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | | Sa2VA-Qwen25-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | | Sa2VA-Qwen25-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 | | Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 | | Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 | | Sa2VA-Qwen25-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 | | Sa2VA-Qwen25-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

Sa2VA Qwen3 VL 4B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2.5/3-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5/3-VL and InternVL3. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | | Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | | Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | | Sa2VA-Qwen25-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | | Sa2VA-Qwen25-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | | Sa2VA-Qwen3-VL-4B | Qwen3-VL-4B-Instruct | Qwen3-4B | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 | | Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 | | Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 | | Sa2VA-Qwen25-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 | | Sa2VA-Qwen25-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 | | Sa2VA-Qwen3-VL-4B | 1660/655 | 86.3 | 81.7 | 77.4 | 80.0 | 57.1 | 75.9 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

Ouro-2.6B

license:apache-2.0

Sa2VA Qwen2 5 VL 7B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2.5-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5-VL and InternVL3. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | | Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | | Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | | Sa2VA-Qwen25-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | | Sa2VA-Qwen25-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 | | Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 | | Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 | | Sa2VA-Qwen25-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 | | Sa2VA-Qwen25-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

Sa2VA-InternVL3-2B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2.5-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5-VL and InternVL3. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | | Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | | Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | | Sa2VA-Qwen25-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | | Sa2VA-Qwen25-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 | | Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 | | Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 | | Sa2VA-Qwen25-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 | | Sa2VA-Qwen25-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

Video As Prompt Wan2.1 14B

license:apache-2.0

MegaTTS3

license:apache-2.0

Video As Prompt CogVideoX 5B

license:apache-2.0

Sa2VA InternVL3 14B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA po...

license:apache-2.0

ListConRanker

XVerse

XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation This repository contains the official model of the paper XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation. XVerse introduces a novel approach to multi-subject image synthesis, offering precise and independent control over individual subjects without disrupting the overall image latents or features. We achieve this by transforming reference images into offsets for token-specific text-stream modulation. This innovation enables high-fidelity, editable image generation where you can robustly control both individual subject characteristics (identity) and their semantic attributes. XVerse significantly enhances capabilities for personalized and complex scene generation. Where to send questions or comments about the model: https://github.com/bytedance/XVerse/issues Citation If XVerse is helpful, please help to ⭐ the repo. If you find this project useful for your research, please consider citing our paper:

license:apache-2.0

Sa2VA-26B

license:apache-2.0

Sa2VA InternVL3 8B

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [\[📂 GitHub\]](https://github.com/bytedance/Sa2VA) [\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001) [\[🚀 Quick Start\]](#quick-start) Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2.5-VL and InternVL3 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2.5-VL and InternVL3 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks. We built the Sa2VA series based on Qwen2.5-VL and InternVL2.5/3. In the following table, we provide some Sa2VA models built on Qwen2.5-VL and InternVL3. | Model Name | Base MLLM | Language Part | HF Link | |:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:| | Sa2VA-InternVL3-2B | InternVL3-2B | Qwen2.5-1.5B | 🤗 link | | Sa2VA-InternVL3-8B | InternVL3-8B | Qwen2.5-7B | 🤗 link | | Sa2VA-InternVL3-14B | InternVL3-14B | Qwen2.5-14B | 🤗 link | | Sa2VA-Qwen25-VL-3B | Qwen2.5-VL-3B-Instruct | Qwen2.5-3B | 🤗 link | | Sa2VA-Qwen25-VL-7B | Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | 🤗 link | Sa2VA Performance | Model Name | MME | MMBench | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (valu) | DAVIS | |:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:| | Sa2VA-InternVL3-2B | 1631/559 | 79.8 | 81.4 | 75.7 | 80.3 | 53.9 | 74.5 | | Sa2VA-InternVL3-8B | 1743/633 | 83.0 | 83.3 | 78.9 | 81.8 | 56.4 | 76.3 | | Sa2VA-InternVL3-14B | 1746/724 | 84.3 | 83.6 | 79.9 | 83.6 | 59.2 | 76.6 | | Sa2VA-Qwen25-VL-3B | 1533/572 | 78.4 | 79.6 | 74.0 | 77.1 | 51.6 | 73.4 | | Sa2VA-Qwen25-VL-7B | 1552/676 | 84.5 | 82.4 | 77.5 | 81.5 | 56.4 | 79.4 | We provide an example code to run `Sa2VA` using `transformers`. If you find this project useful in your research, please consider citing:

license:apache-2.0

ID-Patch

sd2.1-base-zsnr-laionaes5

ContentV-8B

license:apache-2.0

sd2.1-base-zsnr-laionaes6

sd2.1-base-zsnr-laionaes6-perceptual

Lynx

Shen Sang     Tiancheng Zhi     Tianpei Gu     Jing Liu     Linjie Luo Lynx is a state-of-the-art high-fidelity personalized video generation model that creates videos from a single input image while preserving the subject's identity. Built on a Diffusion Transformer (DiT) foundation model with lightweight ID-adapters and Ref-adapters for identity preservation and spatial detail enhancement. - Lynx Full Model (`lynxfull`): Complete v...

license:apache-2.0

DreamO

license:apache-2.0

FaceCLIP

This repository provides the official models for the following paper: Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis Zichuan Liu     Liming Jiang     Qing Yan     Yumin Jia     Hao Kang     Xin Lu     Min Jin Chong > Abstract: Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations. Please clone our GitHub code repository and follow the detailed instruction to install and use the released models for local inference. | Version | Description | |:-------------:|:-------------------------------------------------------------------------:| | FaceCLIP-SDXL | SDXL base model trained with FaceCLIP-L-14 and FaceCLIP-bigG-14 encoders. | | FaceT5-FLUX | FLUX.1-dev base model trained with FaceT5 encoder. | The images used in this repository and related demos are sourced from consented subjects or generated by the models. These pictures are intended solely to showcase the capabilities of our research. If you have any concerns, please feel free to contact us, and we will promptly remove any inappropriate content. Our model is released under the Creative Commons Attribution-NonCommercial 4.0 International Public License for academic research purposes only. Any manual or automatic downloading of the face models from the OpenAI-CLIP-L-14, OpenCLIP-bigG-14, the SDXL-base-1.0 base model, and the FLUX.1-dev base model, etc., must follow their original licenses and be used only for academic research purposes. This research aims to positively impact the field of Generative AI. Any usage of this method must be responsible and comply with local laws. The developers do not assume any responsibility for any potential misuse. If you find FaceCLIP useful for your research or applications, please cite our paper: We also appreciate it if you could give a star ⭐ to our Github repository. Thanks a lot!

license:cc-by-nc-4.0

BindWeave

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [](https://arxiv.org/pdf/2510.00438)  [](https://lzy-dot.github.io/BindWeave/)  BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration Zhaoyang Li 1,2 , Dongjun Qian 2 , Kai Su 2 , Qishuai Diao 2 , Xiangyang Xia 2 , Chang Liu 2 , Wenfei Yang 1 , Tianzhu Zhang 1 , Zehuan Yuan 2 1 University of Science and Technology of China 2 ByteDance 📖 Overview BindWeave is a unified subject-consistent video generation framework for single- and multi-subject prompts, built on an MLLM-DiT architecture that couples a pretrained multimodal large language model with a diffusion transformer. It achieves cross-modal integration via entity grounding and representation alignment, leveraging the MLLM to parse complex prompts and produce subject-aware hidden states that condition the DiT for high-fidelity generation. For more details or tutorials refer to ByteDance/BindWeave OpenS2V-Eval Performance 🏆 BindWeave achieves a solid score of 57.61 on the OpenS2V-Eval benchmark, highlighting its robust capabilities across multiple evaluation dimensions and demonstrating competitive performance against several leading open-source and commercial systems. | Model | TotalScore↑ | AestheticScore↑ | MotionSmoothness↑ | MotionAmplitude↑ | FaceSim↑ | GmeScore↑ | NexusScore↑ | NaturalScore↑ | |------|----|----|----|----|----|----|----|----| | BindWeave | 57.61% | 45.55% | 95.90% | 13.91% | 53.71% | 67.79% | 46.84% | 66.85% | | VACE-14B | 57.55% | 47.21% | 94.97% | 15.02% | 55.09% | 67.27% | 44.08% | 67.04% | | Phantom-14B | 56.77% | 46.39% | 96.31% | 33.42% | 51.46% | 70.65% | 37.43% | 69.35% | | Kling1.6(20250503) | 56.23% | 44.59% | 86.93% | 41.6% | 40.1% | 66.2% | 45.89% | 74.59% | | Phantom-1.3B | 54.89% | 46.67% | 93.3% | 14.29% | 48.56% | 69.43% | 42.48% | 62.5% | | MAGREF-480P | 52.51% | 45.02% | 93.17% | 21.81% | 30.83% | 70.47% | 43.04% | 66.9% | | SkyReels-A2-P14B | 52.25% | 39.41% | 87.93% | 25.6% | 45.95% | 64.54% | 43.75% | 60.32% | | Vidu2.0(20250503) | 51.95% | 41.48% | 90.45% | 13.52% | 35.11% | 67.57% | 43.37% | 65.88% | | Pika2.1(20250503) | 51.88% | 46.88% | 87.06% | 24.71% | 30.38% | 69.19% | 45.4% | 63.32% | | VACE-1.3B | 49.89% | 48.24% | 97.2% | 18.83% | 20.57% | 71.26% | 37.91% | 65.46% | | VACE-P1.3B | 48.98% | 47.34% | 96.8% | 12.03% | 16.59% | 71.38% | 40.19% | 64.31% | BibTeX ```bibtex @article{li2025bindweave, title={BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration}, author={Li, Zhaoyang and Qian, Dongjun and Su, Kai and Diao, Qishuai and Xia, Xiangyang and Liu, Chang and Yang, Wenfei and Zhang, Tianzhu and Yuan, Zehuan}, journal={arXiv preprint arXiv:2510.00438}, year={2025} }

license:apache-2.0

LatentSync

shot2story

Q-Insight

license:apache-2.0

HLLM

license:apache-2.0

Make-An-Audio-2

feature-preserve-portrait-editing

CascadeV

NEVC1.0

NEVC-1.0 (EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding) 📝 Introduction This repository provides the pretrained model weights for NEVC-1.0, which integrates contributions from EHVC (Efficient Hierarchical Reference and Quality Structure for Neural Video Coding) — one of the core components of the framework. EHVC introduces a hierarchical reference and quality structure that significantly improves both compression efficiency and rate–distortion performance. The corresponding code repository can be found here: NEVC-1.0-EHVC. Key designs of EHVC include: - Hierarchical multi-reference: Resolves reference–quality mismatches using a hierarchical reference structure and a multi-reference scheme, optimized for low-delay configurations. - Lookahead mechanism: Enhances encoder-side context by leveraging forward features, thereby improving prediction accuracy and compression. - Layer-wise quantization scale with random quality training: Provides a flexible and efficient quality structure that adapts during training, resulting in improved encoding performance. 🔧 Models EHVC uses two models: the intra model and the inter model. - The intra model handles intra-frame coding. - The inter model is responsible for inter-frame (predictive) coding. Intra Model The main contributions of NEVC-1.0 focus on inter coding. For intra coding, we directly adopt the pretrained model `cvpr2023imagepsnr.pth.tar` from DCVC-DC, without further training. Inter Model The inter model of NEVC-1.0 is provided at `/models/nevc1.0inter.pth.tar`. The architecture of the inter model is illustrated below: BD-Rate (%) comparison for PSNR Anchor: VTM-23.4 LDB. All codecs tested with 96 frames and intra-period = 32. Rate–Distortion curves on HEVC B, HEVC C, UVG, and MCL-JCV datasets. Tested with 96 frames and intra-period = 32. BD-Rate (%) comparison for PSNR Anchor: VTM-23.4 LDB. All codecs tested with full sequences and intra-period = -1. Rate–Distortion curves on HEVC B, HEVC C, UVG, and MCL-JCV datasets. Tested with full sequences and intra-period = -1. 📜 Citation If you find NEVC-1.0 useful in your research or projects, please cite the following paper: - EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding Junqi Liao, Yaojun Wu, Chaoyi Lin, Zhipin Deng, Li Li, Dong Liu, Xiaoyan Sun. Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM 2025). 🙌 Acknowledgement The intra model of this project is based on DCVC-DC.

license:bsd-3-clause-clear

Attention2Probability

license:apache-2.0

AffineQuant

license:apache-2.0