AIDC-AI

47 models • 2 total models in database
Sort by:

Ovis2-4B

It is recommended to use the latest version: Ovis2.5. We are pleased to announce the release of Ovis2, our latest advancement in multi-modal large language models (MLLMs). Ovis2 inherits the innovative architectural design of the Ovis series, aimed at structurally aligning visual and textual embeddings. As the successor to Ovis1.6, Ovis2 incorporates significant improvements in both dataset curation and training methodologies. - Small Model Performance: Optimized training strategies enable small-scale models to achieve higher capability density, demonstrating cross-tier leading advantages. - Enhanced Reasoning Capabilities: Significantly strengthens Chain-of-Thought (CoT) reasoning abilities through the combination of instruction tuning and preference learning. - Video and Multi-Image Processing: Video and multi-image data are incorporated into training to enhance the ability to handle complex visual information across frames and images. - Multilingual Support and OCR: Enhances multilingual OCR beyond English and Chinese and improves structured data extraction from complex visual elements like tables and charts. | Ovis MLLMs | ViT | LLM | Model Weights | Demo | |:-----------|:-----------------------:|:---------------------:|:-------------------------------------------------------:|:--------------------------------------------------------:| | Ovis2-1B | aimv2-large-patch14-448 | Qwen2.5-0.5B-Instruct | Huggingface | Space | | Ovis2-2B | aimv2-large-patch14-448 | Qwen2.5-1.5B-Instruct | Huggingface | Space | | Ovis2-4B | aimv2-huge-patch14-448 | Qwen2.5-3B-Instruct | Huggingface | Space | | Ovis2-8B | aimv2-huge-patch14-448 | Qwen2.5-7B-Instruct | Huggingface | Space | | Ovis2-16B | aimv2-huge-patch14-448 | Qwen2.5-14B-Instruct | Huggingface | Space | | Ovis2-34B | aimv2-1B-patch14-448 | Qwen2.5-32B-Instruct | Huggingface | - | Performance We use VLMEvalKit, as employed in the OpenCompass multimodal and reasoning leaderboard, to evaluate Ovis2. Image Benchmark | Benchmark | Qwen2.5-VL-7B | InternVL2.5-8B-MPO | MiniCPM-o-2.6 | Ovis1.6-9B | InternVL2.5-4B-MPO | Ovis2-4B | Ovis2-8B | |:-----------------------------|:---------------:|:--------------------:|:---------------:|:------------:|:--------------------:|:----------:|:----------:| | MMBench-V1.1 test | 82.6 | 82.0 | 80.6 | 80.5 | 77.8 | 81.4 | 83.6 | | MMStar | 64.1 | 65.2 | 63.3 | 62.9 | 61 | 61.9 | 64.6 | | MMMU val | 56.2 | 54.8 | 50.9 | 55 | 51.8 | 49.0 | 57.4 | | MathVista testmini | 65.8 | 67.9 | 73.3 | 67.3 | 64.1 | 69.6 | 71.8 | | HallusionBench | 56.3 | 51.7 | 51.1 | 52.2 | 47.5 | 53.8 | 56.3 | | AI2D | 84.1 | 84.5 | 86.1 | 84.4 | 81.5 | 85.7 | 86.6 | | OCRBench | 87.7 | 88.2 | 88.9 | 83 | 87.9 | 91.1 | 89.1 | | MMVet | 66.6 | 68.1 | 67.2 | 65 | 66 | 65.5 | 65.1 | | MMBench test | 83.4 | 83.2 | 83.2 | 82.7 | 79.6 | 83.2 | 84.9 | | MMT-Bench val | 62.7 | 62.5 | 62.3 | 64.9 | 61.6 | 65.2 | 66.6 | | RealWorldQA | 68.8 | 71.1 | 68.0 | 70.7 | 64.4 | 71.1 | 72.5 | | BLINK | 56.1 | 56.6 | 53.9 | 48.5 | 50.6 | 53.0 | 54.3 | | QBench | 77.9 | 73.8 | 78.7 | 76.7 | 71.5 | 78.1 | 78.9 | | ABench | 75.6 | 77.0 | 77.5 | 74.4 | 75.9 | 77.5 | 76.4 | | MTVQA | 28.5 | 27.2 | 23.1 | 19.2 | 28 | 29.4 | 29.7 | Video Benchmark | Benchmark | Qwen2.5-VL-7B | InternVL2.5-8B | LLaVA-OV-7B | InternVL2.5-4B | Ovis2-4B | Ovis2-8B | |:--------------------|:-------------:|:--------------:|:------------------:|:--------------:|:---------:|:-------------:| | VideoMME(wo/w-subs) | 65.1/71.6 | 64.2 / 66.9 | 58.2/61.5 | 62.3 / 63.6 | 64.0/66.3 | 68.0/71.6 | | MVBench | 69.6 | 72.0 | 56.7 | 71.6 | 68.45 | 68.15 | | MLVU(M-Avg/G-Avg) | 70.2/- | 68.9/- | 64.7/- | 68.3/- | 70.8/4.23 | 76.4/4.25 | | MMBench-Video | 1.79 | 1.68 | - | 1.73 | 1.69 | 1.85 | | TempCompass | 71.7 | - | - | - | 67.02 | 69.28 | Usage Below is a code snippet demonstrating how to run Ovis with various input types. For additional usage instructions, including inference wrapper and Gradio UI, please refer to Ovis GitHub. Citation If you find Ovis useful, please consider citing the paper License This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0). Disclaimer We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

NaNK
license:apache-2.0
78,725
62

Ovis2.5-2B

We are pleased to announce the release of Ovis2.5, the successor to Ovis2, designed for native-resolution visual perception and enhanced multimodal reasoning. It integrates a native-resolution vision transformer (NaViT) that processes images at their original, variable resolutions, eliminating the need for fixed-resolution tiling and preserving both fine details and global layout—crucial for visually dense content such as charts and diagrams. To strengthen reasoning, Ovis2.5 is trained not only on linear chain-of-thought (CoT) but also on reflective reasoning, including self-checking and revision. This advanced capability is available at inference as an optional thinking mode, enabling users to trade latency for higher accuracy on complex inputs. Building on these advances, Ovis2.5-9B achieves an average score of 78.3 on the OpenCompass multimodal evaluation suite (SOTA among open-source MLLMs under 40B parameters), while the lightweight Ovis2.5-2B scores 73.9, continuing the “small model, big performance” philosophy for resource-constrained scenarios. Key Features Native-Resolution Perception — NaViT vision encoder preserves fine details and global structure without lossy tiling. Deep-Reasoning Capability — Optional thinking mode for self-checking and revision beyond linear CoT. Thinking budget supported. Chart & Document OCR — State-of-the-art at its scale for complex chart analysis, document understanding (including tables and forms), and OCR. Broad Task Coverage — Demonstrates leading performance on image reasoning, video understanding, and grounding benchmarks, showcasing strong general multimodal capability. Below is a simple example demonstrating how to run Ovis2.5 with a single image input. For accelerated inference with vLLM, refer to GitHub. The thinking and thinking budget logic can be applied in the same way for multi-image, video and pure text scenarios. Note (answer extraction for CoT/Thinking): To make evaluation and usage easier, we recommend appending a fixed suffix to prompts when using chain-of-thought (CoT) or thinking mode. This ensures the model clearly outputs a final answer that can be extracted programmatically: Tip: The sections below include an optional streaming helper (compatible with two-phase thinking/budget runs) and extra inference modes: multi-image, video, and text-only. To support thinking budget, we modified the implementation of the Ovis `generate` method and the default `TextIteratorStreamer` is now incompatible. If you need to stream model output, be sure to use the helper class below. Example: Multi-image Demonstrates how to run inference with multiple images and a related question. Example: Video Demonstrates how to run inference on a video by sampling multiple frames and asking the model to describe the content. Example: Text-only Demonstrates how to run inference using only text input without any images or videos. To enable grounding, end your prompt with `Please provide the bounding box coordinates.` (for boxes) or `Please provide the point coordinates.` (for points). To target a specific object, wrap its description in ` ` tags, e.g.: Coordinates are normalized to `[0,1)` with the origin `(0,0)` at the top-left corner of the image. Point: ` (x,y) ` Bounding box: ` (x1,y1),(x2,y2) ` where `(x1,y1)` is top-left, `(x2,y2)` is bottom-right. Multiple results can be listed in square brackets: `[ (...) , (...) ]` | Ovis MLLMs | ViT | LLM | Model Weights | Demo | |:-----------|:-----------------------:|:---------------------:|:-------------------------------------------------------:|:--------------------------------------------------------:| | Ovis2.5-2B | siglip2-so400m-patch16-512 | Qwen3-1.7B | Huggingface | Space | | Ovis2.5-9B | siglip2-so400m-patch16-512 | Qwen3-8B | Huggingface | Space | Performance We evaluate Ovis2.5 using VLMEvalKit, as employed in the OpenCompass multimodal and reasoning evaluation suite. Citation If you find Ovis useful, please consider citing the paper License This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0). Disclaimer We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

NaNK
license:apache-2.0
64,812
200

Ovis2-8B

NaNK
license:apache-2.0
36,069
75

Ovis2.6-30B-A3B

NaNK
license:apache-2.0
15,076
139

Ovis2-2B

NaNK
license:apache-2.0
5,544
59

Ovis2.5-9B

NaNK
license:apache-2.0
4,345
303

Ovis2-1B

NaNK
license:apache-2.0
3,867
96

Ovis2-4B-GPTQ-Int4

NaNK
license:apache-2.0
3,750
3

Ovis-U1-3B

NaNK
license:apache-2.0
3,145
203

Ovis2-34B-GPTQ-Int4

NaNK
license:apache-2.0
2,901
8

Ovis-Image-7B

NaNK
license:apache-2.0
2,380
183

Marco-LLM-GLO

license:apache-2.0
1,328
5

Ovis2-34B

NaNK
license:apache-2.0
1,059
151

Ovis2-8B-GPTQ-Int4

NaNK
license:apache-2.0
1,048
3

Marco-Nano-Instruct

license:apache-2.0
1,000
26

Marco-LLM-ES

license:apache-2.0
847
1

Marco MT Algharb

license:apache-2.0
656
19

Ovis2-16B

NaNK
license:apache-2.0
378
101

Ovis1.6-Gemma2-9B

NaNK
license:apache-2.0
342
275

CHATS

CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation (ICML2025) CHATS is a next-generation framework that unifies human preference alignment with classifier-free guidance by modeling both preferred and dispreferred distributions and using a proxy-prompt-based sampling strategy for superior text–image alignment, fidelity, and aesthetic consistency. See the images generated below for examples. Generation examples using CHATS (cf. Fig.1 in our paper). - Human-Aligned Fine-Tuning with CFG Integration We integrate human preference alignment with classifier-free guidance sampling into a unified framework. - Proxy-Prompt Sampling Leverage useful signals from both preferred and dispreferred distributions at test time. - Data Efficiency State-of-the-art results across benchmarks with minimal fine-tuning effort on a small, high-quality dataset. - Plug-and-Play Compatible with any diffusion backbone and existing guidance methods. We provide pretrained CHATS checkpoints on SDXL for easy download and evaluation: - Model Repository: [](https://huggingface.co/AIDC-AI/CHATS) To train CHATS from scratch or fine-tune on your own data, run: Args: - configfile: This DeepSpeed parameter allows you to specify the configuration file. If you wish to adjust the number of GPUs used for training, simply change the value of numprocesses in the acdsxgpuzero0.yaml file to reflect the desired GPU count. - pretrainedmodelnameorpath: name or patch of unet model to load - pretrainedvaemodelnameorpath: name or patch of vae model to load - maxtrainsteps: max steps to train - output: output dir - datasetname: the huggingface sufix of the selected dataset (e.g. OIP) The code is built upon DiffusionDPO, Diffusers, and Transformers. The project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0). We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

NaNK
license:apache-2.0
296
7

Marco-Mini-Instruct

license:apache-2.0
248
12

Ovis2-16B-GPTQ-Int4

NaNK
license:apache-2.0
236
5

Ovis1.5-Llama3-8B

NaNK
license:apache-2.0
227
27

Ovis1.6-Llama3.2-3B

NaNK
license:apache-2.0
182
49

Marco-LLM-AR-V2

license:apache-2.0
147
2

Marco-LLM-SEA

NaNK
license:apache-2.0
138
1

Marco-LLM-AR-V4

NaNK
license:apache-2.0
138
0

Parrot-7B

NaNK
license:apache-2.0
134
4

Ovis1.5-Gemma2-9B

NaNK
license:apache-2.0
130
19

Ovis1.6-Gemma2-27B

NaNK
license:apache-2.0
127
62

Ovis1.6-Gemma2-9B-GPTQ-Int4

NaNK
license:apache-2.0
125
9

Ovis1.6-Llama3.2-3B-GPTQ-Int4

NaNK
base_model:AIDC-AI/Ovis1.6-Llama3.2-3B
115
4

Parrot-14B

NaNK
license:apache-2.0
66
3

Ovis-Clip-Llama3-8B

NaNK
license:apache-2.0
64
7

Ovis-Clip-Qwen1_5-14B

NaNK
license:apache-2.0
64
3

Ovis-Clip-Qwen1_5-7B

NaNK
license:apache-2.0
64
2

Ovis2-34B-GPTQ-Int8

NaNK
license:apache-2.0
62
2

Ovis2-2B-GPTQ-Int4

NaNK
license:apache-2.0
61
2

TeEFusion

NaNK
license:cc-by-nc-4.0
20
2

CHATS-SD1d5

CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation (ICML2025) CHATS is a next-generation framework that unifies human preference alignment with classifier-free guidance by modeling both preferred and dispreferred distributions and using a proxy-prompt-based sampling strategy for superior text–image alignment, fidelity, and aesthetic consistency. See the images generated below for examples. Generation examples using CHATS (cf. Fig.1 in our paper). - Human-Aligned Fine-Tuning with CFG Integration We integrate human preference alignment with classifier-free guidance sampling into a unified framework. - Proxy-Prompt Sampling Leverage useful signals from both preferred and dispreferred distributions at test time. - Data Efficiency State-of-the-art results across benchmarks with minimal fine-tuning effort on a small, high-quality dataset. - Plug-and-Play Compatible with any diffusion backbone and existing guidance methods. We provide pretrained CHATS checkpoints for easy download and evaluation: - SD1.5: [](https://huggingface.co/AIDC-AI/CHATS-SD1d5) To train CHATS from scratch or fine-tune on your own data, run: Args: - configfile: This DeepSpeed parameter allows you to specify the configuration file. If you wish to adjust the number of GPUs used for training, simply change the value of numprocesses in the acdsxgpuzero0.yaml file to reflect the desired GPU count. - pretrainedmodelnameorpath: name or patch of unet model to load - pretrainedvaemodelnameorpath: name or patch of vae model to load - maxtrainsteps: max steps to train - output: output dir - datasetname: the huggingface sufix of the selected dataset (e.g. OIP) See the scripts in the `scripts` folder for further information. The code is built upon DiffusionDPO, Diffusers, and Transformers. The project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0). We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

NaNK
license:apache-2.0
4
4

Meissonic

license:apache-2.0
1
5

UNIC-Adapter

UNIC-Adapter: Unified Image-Instruction Adapter for Multimodal Image Generation [](https://arxiv.org/abs/2412.18928) [](https://opensource.org/licenses/MIT) [](https://github.com/AIDC-AI/UNIC-Adapter) UNIC-Adapter is a unified image-instruction adapter that integrates multimodal instructions for controllable image generation. This model card hosts the official models for the CVPR 2025 paper "UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation". On this model card, we release a model based on SD3 Medium, which supports the tasks described in our paper. In addition, we also provide two additional models: one built on SD3.5 Medium, which is capable of traditional computer vision perception tasks, and another on FLUX.1-dev, which supports both instruction-based image editing and traditional computer vision perception tasks. Pixel-level Control (Left: Condition image, Center left: SD3 Medium with UNIC-Adapter, Center right: SD3.5 Medium with UNIC-Adapter, Right: FLUX.1-dev with UNIC-Adapter) Subject-driven Generation (Left: Condition image, Center left: SD3 Medium with UNIC-Adapter, Center right: SD3.5 Medium with UNIC-Adapter, Right: FLUX.1-dev with UNIC-Adapter) (Left: condition image, Center: SD 3.5 Medium with UNIC-Adapter, Right: FLUX.1-dev with UNIC-Adapter) Style-driven Generation (Left: Condition image, Center left: SD3 Medium with UNIC-Adapter, Center right: SD3.5 Medium with UNIC-Adapter, Right: FLUX.1-dev with UNIC-Adapter) Image Understanding (Left: Source image, Center: SD3.5 Medium with UNIC-Adapter, Right: FLUX.1-dev with UNIC-Adapter) Image Editing (Left: Source image, Right: FLUX.1-dev with UNIC-Adapter) License This project is licensed under the MIT License (SPDX-License-Identifier: MIT). The models cannot be used independently. If you use our model in conjunction with the Flux model, you must review the [FLUX.1 [dev] Non-Commercial License](https://github.com/black-forest-labs/flux/blob/main/modellicenses/LICENSE-FLUX1-dev) of the Flux model and comply with all of its terms; If you use our model in conjunction with the stable-diffusion-3-medium model, then you must review the STABILITY AI COMMUNITY LICENSE AGREEMENT of the SD3 model and comply with all of its terms; If you use our model in conjunction with the stable-diffusion-3.5-medium model, then you must review the STABILITY AI COMMUNITY LICENSE AGREEMENT of the SD3.5 model and comply with all of its terms. Citation If you find this repo is helpful for your research, please cite our paper: Disclaimer We used compliance checking algorithms during the training process, to ensure the compliance of the trained model(s) to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

NaNK
license:mit
0
6

Omni-View

license:apache-2.0
0
5

Marco-Voice

license:apache-2.0
0
4

Wings-Qwen1_5-8B

Wings: A Versatile Multimodal LLM without Text-only Forgetting 🪽 Here is the demo of Inference. We apologize for any inconvenience. Currently, Wings can only be loaded through the raw method, but we are working on improving this. We have released Wings base -Qwen15-8B, a version aligned with LLaVA-v1.5 pretrain and finetune training data. - Why Wings? - How to use - Quick start - Citation - License - Disclaimer - Wings is a brand-new universal Multimodal Large Language Model (MLLM). Its flexible multimodal structure enhances the MLLM as if giving it wings that enhance the performance of multimodal capabilities while minimizing text-only forgetting. - Any architecture of MLLM can adapt the Wings component. Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this work, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks. This project is licensed under the Apache License, Version 2.0 (SPDX-License-Identifier: Apache-2.0). We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

NaNK
license:apache-2.0
0
2

Marco-Mini-Global-Base

NaNK
license:apache-2.0
0
1

Diffusion-SDPO

license:apache-2.0
0
1