NCSOFT
VARCO-VISION-2.0-1.7B
Llama-VARCO-8B-Instruct
Language support for English and Korean.
VARCO-VISION-14B
🚨News🎙️ - The 2.0 model has been released. Please use the new version. - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at link - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at link VARCO-VISION-14B is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The Model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition). - Developed by: NC Research, Multimodal Generation Team - Technical Report: VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models - Blog(Korean): VARCO-VISION Technical Report Summary - Demo Page: The demo page is no longer available. - Languages: Korean, English - License: CC BY-NC 4.0 - Architecture: VARCO-VISION-14B follows the architecture of LLaVA-OneVision. - Base Model: - Language Model: Qwen/Qwen2.5-14B-Instruct - Vision Encoder: google/siglip-so400m-patch14-384 - Huggingface Version Model: NCSOFT/VARCO-VISION-14B-HF - Korean VLM Benchmarks: - You can use the following benchmark datasets in the LLMs-Eval toolkit. - NCSOFT/K-MMBench - NCSOFT/K-SEED - NCSOFT/K-MMStar - NCSOFT/K-DTCBench - NCSOFT/K-LLaVA-W - you can also evaluate VARCO-VISION-14B in the VLMEval kit. - This model is for research purposes only. Commercial use is prohibited. To load VARCO-VISION-14B, start by cloning and installing LLaVA-NeXT: After installing LLaVA-NeXT, you can load VARCO-VISION-14B using the following code: Prepare an image and a text input. You need to preprocess the image and tokenize the text. Pass the processed inputs to the model to generate predictions. If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text. The following special tokens are used to define specific tasks, inputs, and outputs for the model: - ` `: Indicates that the model's response should include bounding box information. - ` `: Specifies OCR tasks for recognizing text within an image. - ` ` and ` `: Used to mark a text phrase. - ` ` and ` `: Used to indicate an object. - ` ` and ` `: Used to represent a bounding box. - ` `: Represents multiple location points for a single object or text. Grounding Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token ` ` to the question. VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within ` ` and ` ` tags. You have to specify its location with ` ` and ` ` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position. To perform Optical Character Recognition (OCR), use the ` ` token. If you use VARCO-VISION-14B in your research, please cite the following:
VARCO VISION 2.0 14B
Introduction VARCO-VISION-2.0 is a multimodal AI model capable of understanding both images and text to answer user queries. It supports multi-image inputs, enabling effective processing of complex content such as documents, tables, and charts. The model demonstrates strong comprehension in both Korean and English, with significantly improved text generation capabilities and a deeper understanding of Korean cultural context. Compared to its predecessor, performance has been notably enhanced across various benchmarks, and its usability in real-world scenarios—such as everyday Q&A and information summarization—has also improved. In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-weight AI model built for Korean users and is freely available for a wide range of applications. 🚨News🎙️ - 📝 2025-09-12: We published the technical report of VARCO-VISION-2.0 at link - 🛠️ 2025-08-22: We updated the checkpoint of VARCO-VISION-2.0-1.7B for improved performance. - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at link - 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at link - 🛠️ 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance. - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at link - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at link Key Features - Multi-image Understanding: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions. - Korean Language Specialization: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses. - OCR with Text Localization: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data. - Enhanced Safety: The model now offers improved handling of harmful or sexually explicit content, ensuring safer and more reliable interactions. VARCO-VISION-2.0 Family | Model Name | Base Models (Vision / Language) | HF Link | | :------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------: | | VARCO-VISION-2.0-14B | siglip2-so400m-patch16-384 / Qwen3-14B | link | | VARCO-VISION-2.0-1.7B | siglip2-so400m-patch16-384 / Qwen3-1.7B | link | | VARCO-VISION-2.0-1.7B-OCR | siglip2-so400m-patch16-384 / Qwen3-1.7B | link | | GME-VARCO-VISION-Embedding | Qwen2-VL-7B-Instruct | link | Model Architecture VARCO-VISION-2.0 follows the architecture of LLaVA-OneVision. Evaluation We used VLMEvalKit for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, ensuring fair comparisons with various open-weight models. Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior. Korean Benchmark | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B | | :-----------: | :-----------: | :-------: | :-----------: | :------------------: | | K-MMBenchDEV | 89.1 | 86.0 | 84.7 | 87.7 | | K-MMStar | 64.9 | 29.7 | 49.3 | 63.6 | | K-SEED | 78.2 | 73.2 | 75.7 | 77.2 | | K-LLaVA-W | 80.9 | 86.3 | 94.1 | 96.5 | | K-DTCBench | 87.9 | 81.7 | 82.1 | 78.3 | | AVERAGE | 80.2 | 71.4 | 77.2 | 80.7 | English Benchmark | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B | | :-------------: | :-----------: | :-------: | :-----------: | :------------------: | | MMStar | 68.9 | 67.2 | 64.1 | 66.9 | | MMMUVAL | 64.8 | 60.7 | 58.0 | 61.9 | | MathVista | 74.4 | 73.7 | 68.1 | 73.2 | | OCRBench | 87.7 | 87.9 | 88.8 | 86.9 | | AI2D | 86.0 | 86.3 | 84.3 | 85.7 | | HallusionBench | 55.9 | 56.8 | 51.9 | 53.2 | | MMVet | 80.5 | 68.4 | 69.7 | 68.9 | | SEEDBenchIMG | 77.5 | 77.7 | 77.0 | 78.0 | | LLaVABench | 84.4 | 93.0 | 91.0 | 90.2 | | RealWorldQA | 69.8 | 74.1 | 68.4 | 74.6 | | POPE | 89.4 | 87.5 | 85.9 | 89.2 | | ScienceQATEST | 98.6 | 95.2 | 89.0 | 93.5 | | SEEDBench2Plus | 70.1 | 72.1 | 70.7 | 71.9 | | BLINK | 59.9 | 59.0 | 55.3 | 54.5 | | TextVQAVAL | 82.2 | 83.0 | 85.4 | 80.4 | | ChartQATEST | 87.8 | 79.1 | 80.6 | 84.2 | | Q-Bench1VAL | 76.5 | 79.2 | 78.2 | 79.9 | | A-BenchVAL | 76.3 | 79.6 | 75.4 | 79.5 | | DocVQATEST | 94.1 | 94.9 | 95.7 | 90.9 | | InfoVQATEST | 83.6 | 82.8 | 82.6 | 80.4 | | AVERAGE | 78.4 | 77.9 | 76.0 | 77.2 | Text-only Benchmark | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B | | :-----------: | :-----------: | :-------: | :-----------: | :------------------: | | MMLU | 78.5 | 78.4 | 4.6 | 77.9 | | MT-Bench | 89.3 | 85.9 | 80.7 | 89.8 | | KMMLU | 51.4 | 49.3 | 39.6 | 57.5 | | KoMT-Bench | 70.1 | 79.1 | 68.4 | 78.3 | | LogicKor | 70.0 | 79.4 | 65.5 | 74.0 | | AVERAGE | 71.9 | 74.4 | 51.7 | 75.5 | > Note: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results. Korean Cultural Benchmark | Benchmark | InternVL3-14B | Ovis2-16B | Qwen2.5-VL-7B | VARCO-VISION-2.0-14B | | :--------------: | :-----------: | :-------: | :-----------: | :------------------: | | K-Viscuit | 71.7 | 77.0 | 70.9 | 73.7 | | PangeaBench (ko) | 77.2 | 76.9 | 76.6 | 74.5 | | AVERAGE | 74.5 | 77.0 | 73.8 | 74.1 | OCR Benchmark | Benchmark | PaddleOCR | EasyOCR | VARCO-VISION-2.0-14B | | :-----------: | :-------: | :-----: | :------------------: | | CORD | 91.4 | 77.8 | 97.1 | | ICDAR2013 | 92.0 | 85.0 | 95.7 | | ICDAR2015 | 73.7 | 57.9 | 79.4 | | AVERAGE | 85.7 | 73.6 | 90.7 | Usage To use this model, we recommend installing `transformers` version 4.53.1 or higher. While it may work with earlier versions, using 4.53.1 or above is strongly recommended, especially to ensure optimal performance for the multi-image feature. Batch inference All inputs in a batch must have the same modality structure—for example, text-only with text-only, single-image with single-image, and multi-image with multi-image—to ensure correct batch inference.
VARCO-VISION-2.0-1.7B-OCR
GME-VARCO-VISION-Embedding
Llama-3-OffsetBias-RM-8B
VARCO-VISION-14B-HF
🚨News🎙️ - The 2.0 model has been released. Please use the new version. - 📰 2025-07-16: We released VARCO-VISION-2.0-14B at link - 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at link VARCO-VISION-14B is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The Model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition). - Developed by: NC Research, Multimodal Generation Team - Technical Report: VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models - Blog(Korean): VARCO-VISION Technical Report Summary - Demo Page: The demo page is no longer available. - Languages: Korean, English - License: CC BY-NC 4.0 - Architecture: VARCO-VISION-14B follows the architecture of LLaVA-OneVision. - Base Model: - Language Model: Qwen/Qwen2.5-14B-Instruct - Vision Encoder: google/siglip-so400m-patch14-384 - LLaVA-NeXT Codebase Model: NCSOFT/VARCO-VISION-14B - Korean VLM Benchmarks: - You can use the following benchmark datasets in the LLMs-Eval toolkit. - NCSOFT/K-MMBench - NCSOFT/K-SEED - NCSOFT/K-MMStar - NCSOFT/K-DTCBench - NCSOFT/K-LLaVA-W - you can also evaluate VARCO-VISION-14B in the VLMEval kit. - This model is for research purposes only. Commercial use is prohibited. Direct Use To use this model, ensure you have `transformers >= 4.45.0` installed. If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text. The following special tokens are used to define specific tasks, inputs, and outputs for the model: - ` `: Indicates that the model's response should include bounding box information. - ` `: Specifies OCR tasks for recognizing text within an image. - ` ` and ` `: Used to mark a text phrase. - ` ` and ` `: Used to indicate an object. - ` ` and ` `: Used to represent a bounding box. - ` `: Represents multiple location points for a single object or text. Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token ` ` to the question. VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within ` ` and ` ` tags. You have to specify its location with ` ` and ` ` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position. To perform Optical Character Recognition (OCR), use the ` ` token. If you use VARCO-VISION-14B in your research, please cite the following: