OpenGVLab

βœ“ VerifiedResearch Lab

Open General Vision Lab, multimodal AI research

280 models β€’ 23 total models in database
Sort by:

InternVL3_5-241B-A28B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
1,545,878
15

InternVL2-2B

--- license: mit pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternViT-300M-448px - internlm/internlm2-chat-1_8b new_version: OpenGVLab/InternVL2_5-2B base_model_relation: merge language: - multilingual tags: - internvl - custom_code ---

NaNK
license:mit
1,056,058
76

InternVL3_5-GPT-OSS-20B-A4B-Preview-HF

--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternViT-300M-448px-V2_5 - openai/gpt-oss-20b base_model_relation: merge datasets: - OpenGVLab/MMPR-v1.2 - OpenGVLab/MMPR-Tiny language: - multilingual tags: - internvl - custom_code ---

NaNK
license:apache-2.0
658,576
5

InternVL3-14B

--- license: apache-2.0 license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternVL3-14B-Instruct base_model_relation: finetune datasets: - OpenGVLab/MMPR-v1.2 language: - multilingual tags: - internvl - custom_code ---

NaNK
license:apache-2.0
651,110
77

InternVL2-1B

--- license: mit pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternViT-300M-448px - Qwen/Qwen2-0.5B-Instruct new_version: OpenGVLab/InternVL2_5-1B base_model_relation: merge language: - multilingual tags: - internvl - custom_code ---

NaNK
license:mit
519,769
76

InternVL2_5-4B-MPO

--- license: mit pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternVL2_5-4B base_model_relation: finetune datasets: - OpenGVLab/MMPR-v1.1 language: - multilingual tags: - internvl - custom_code ---

NaNK
license:mit
231,216
18

InternVL3-8B

--- license: apache-2.0 license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE pipeline_tag: image-text-to-text library_name: transformers base_model: - OpenGVLab/InternVL3-8B-Instruct base_model_relation: finetune datasets: - OpenGVLab/MMPR-v1.2 language: - multilingual tags: - internvl - custom_code ---

NaNK
license:apache-2.0
225,632
102

InternVL2_5-4B-AWQ

NaNK
license:mit
113,254
7

InternVL3-2B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. In the following table, we provide an overview of the InternVL3 series. | Model Name | Vision Part | Language Part | HF Link | | :-----------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :------------------------------------------------------: | | InternVL3-1B | InternViT-300M-448px-V25 | Qwen2.5-0.5B | πŸ€— link | | InternVL3-2B | InternViT-300M-448px-V25 | Qwen2.5-1.5B | πŸ€— link | | InternVL3-8B | InternViT-300M-448px-V25 | Qwen2.5-7B | πŸ€— link | | InternVL3-9B | InternViT-300M-448px-V25 | internlm3-8b-instruct | πŸ€— link | | InternVL3-14B | InternViT-300M-448px-V25 | Qwen2.5-14B | πŸ€— link | | InternVL3-38B | InternViT-6B-448px-V25 | Qwen2.5-32B | πŸ€— link | | InternVL3-78B | InternViT-6B-448px-V25 | Qwen2.5-72B | πŸ€— link | As shown in the following figure, InternVL3 retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors. We propose a Native Multimodal Pre-Training approach that consolidates language and vision learning into a single pre-training stage. In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules. Please see our paper for more details. In this phase, the techniques of random JPEG compression, square loss re-weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher-quality and more diverse training data. Specifically, we further extend training samples for tool use, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning. During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the model’s Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{\text{p}}\\), quality loss \\(\mathcal{L}{\text{q}}\\), and generation loss \\(\mathcal{L}{\text{g}}\\), which can be formulated as follows: $$ \mathcal{L}=w{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}{\text{g}}, $$ where \\(w{}\\) represents the weight assigned to each loss component. Please see our paper for more details about MPO. Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation. Comprehensive Multimodal & Hallucination Evaluation We compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. Please note that the evaluation scores of Qwen2.5 series may differ from those officially reported, as we have adopted the prompt versions provided in the table across all datasets for OpenCompass evaluation. We conduct experiments on the InternVL2-8B model while keeping its architecture, initialization parameters, and training data entirely unchanged. Traditionally, InternVL2-8B employs a training pipeline that begins with an MLP warmup phase for feature alignment followed by an Instruction Tuning stage. In our experiments, we substitute the conventional MLP warmup phase with a native multimodal pre-training process. This modification isolates the contribution of native multimodal pre-training to the overall multimodal capability of the model. The evaluation results in the Figure below shows that the model with native multimodal pre-training exhibits performance on most benchmarks that is comparable to the fully multi-stage-trained InternVL2-8B baseline. Furthermore, when followed by instruction tuning on higher-quality data, the model demonstrates further performance gains across evaluated multimodal tasks. These findings underscore the efficiency of native multimodal pre-training in imparting powerful multimodal capabilities to MLLMs. As shown in the table below, models fine-tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3-78B and InternVL3-38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data. As reported in the table below, the introduction of V2PE leads to significant performance gains across most evaluation metrics. In addition, our ablation studiesβ€”by varying the positional increment \\( \delta \\)β€”reveal that even for tasks primarily involving conventional contexts, relatively small \\( \delta \\) values can achieve optimal performance. These findings provide important insights for future efforts aimed at refining position encoding strategies for visual tokens in MLLMs. We provide an example code to run `InternVL3-2B` using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
105,676
38

InternVL3-1B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. In the following table, we provide an overview of the InternVL3 series. | Model Name | Vision Part | Language Part | HF Link | | :-----------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :------------------------------------------------------: | | InternVL3-1B | InternViT-300M-448px-V25 | Qwen2.5-0.5B | πŸ€— link | | InternVL3-2B | InternViT-300M-448px-V25 | Qwen2.5-1.5B | πŸ€— link | | InternVL3-8B | InternViT-300M-448px-V25 | Qwen2.5-7B | πŸ€— link | | InternVL3-9B | InternViT-300M-448px-V25 | internlm3-8b-instruct | πŸ€— link | | InternVL3-14B | InternViT-300M-448px-V25 | Qwen2.5-14B | πŸ€— link | | InternVL3-38B | InternViT-6B-448px-V25 | Qwen2.5-32B | πŸ€— link | | InternVL3-78B | InternViT-6B-448px-V25 | Qwen2.5-72B | πŸ€— link | As shown in the following figure, InternVL3 retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors. We propose a Native Multimodal Pre-Training approach that consolidates language and vision learning into a single pre-training stage. In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules. Please see our paper for more details. In this phase, the techniques of random JPEG compression, square loss re-weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher-quality and more diverse training data. Specifically, we further extend training samples for tool use, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning. During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the model’s Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{\text{p}}\\), quality loss \\(\mathcal{L}{\text{q}}\\), and generation loss \\(\mathcal{L}{\text{g}}\\), which can be formulated as follows: $$ \mathcal{L}=w{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}{\text{g}}, $$ where \\(w{}\\) represents the weight assigned to each loss component. Please see our paper for more details about MPO. Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation. Comprehensive Multimodal & Hallucination Evaluation We compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. Please note that the evaluation scores of Qwen2.5 series may differ from those officially reported, as we have adopted the prompt versions provided in the table across all datasets for OpenCompass evaluation. We conduct experiments on the InternVL2-8B model while keeping its architecture, initialization parameters, and training data entirely unchanged. Traditionally, InternVL2-8B employs a training pipeline that begins with an MLP warmup phase for feature alignment followed by an Instruction Tuning stage. In our experiments, we substitute the conventional MLP warmup phase with a native multimodal pre-training process. This modification isolates the contribution of native multimodal pre-training to the overall multimodal capability of the model. The evaluation results in the Figure below shows that the model with native multimodal pre-training exhibits performance on most benchmarks that is comparable to the fully multi-stage-trained InternVL2-8B baseline. Furthermore, when followed by instruction tuning on higher-quality data, the model demonstrates further performance gains across evaluated multimodal tasks. These findings underscore the efficiency of native multimodal pre-training in imparting powerful multimodal capabilities to MLLMs. As shown in the table below, models fine-tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3-78B and InternVL3-38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data. As reported in the table below, the introduction of V2PE leads to significant performance gains across most evaluation metrics. In addition, our ablation studiesβ€”by varying the positional increment \\( \delta \\)β€”reveal that even for tasks primarily involving conventional contexts, relatively small \\( \delta \\) values can achieve optimal performance. These findings provide important insights for future efforts aimed at refining position encoding strategies for visual tokens in MLLMs. We provide an example code to run `InternVL3-1B` using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
93,246
74

InternVL3-1B-hf

[\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) > [!IMPORTANT] > This repository contains the Hugging Face πŸ€— Transformers implementation for the OpenGVLab/InternVL3-1B model. > It is intended to be functionally equivalent to the original OpenGVLab release. > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs. We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. You can find more info on the InternVL3 family in the original checkpoint OpenGVLab/InternVL3-1B Here is how you can use the `image-text-to-text` pipeline to perform inference with the `InternVL3` models in just a few lines of code: This example demonstrates how to perform inference on a single image with the InternVL models using chat templates. > [!NOTE] > Note that the model has been trained with a specific prompt format for chatting. Use `processor.applychattemplate(myconversationdict)` to correctly format your prompts. Text-only generation This example shows how to generate text using the InternVL model without providing any image input. Batched image and text inputs InternVL models also support batched image and text inputs. Batched multi-image input This implementation of the InternVL models supports batched text-images inputs with different number of images for each text. Video input InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates. Interleaved image and video inputs This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template. This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Qwen License. If you find this project useful in your research, please consider citing:

NaNK
β€”
69,146
8

InternVL3_5-2B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
57,254
3

InternVL3_5-8B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
55,734
82

InternVL3_5-1B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
53,515
18

InternVL3_5-4B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
53,006
3

InternVL2_5-2B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We are excited to introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In the following table, we provide an overview of the InternVL 2.5 series. | Model Name | Vision Part | Language Part | HF Link | | :-------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :---------------------------------------------------------: | | InternVL25-1B | InternViT-300M-448px-V25 | Qwen2.5-0.5B-Instruct | πŸ€— link | | InternVL25-2B | InternViT-300M-448px-V25 | internlm25-18b-chat | πŸ€— link | | InternVL25-4B | InternViT-300M-448px-V25 | Qwen2.5-3B-Instruct | πŸ€— link | | InternVL25-8B | InternViT-300M-448px-V25 | internlm25-7b-chat | πŸ€— link | | InternVL25-26B | InternViT-6B-448px-V25 | internlm25-20b-chat | πŸ€— link | | InternVL25-38B | InternViT-6B-448px-V25 | Qwen2.5-32B-Instruct | πŸ€— link | | InternVL25-78B | InternViT-6B-448px-V25 | Qwen2.5-72B-Instruct | πŸ€— link | As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. In InternVL 2.0 and 2.5, we extend the dynamic high-resolution training approach, enhancing its capabilities to handle multi-image and video datasets. - For single-image datasets, the total number of tiles `nmax` are allocated to a single image for maximum resolution. Visual tokens are enclosed in ` ` and ` ` tags. - For multi-image datasets, the total number of tiles `nmax` are distributed across all images in a sample. Each image is labeled with auxiliary tags like `Image-1` and enclosed in ` ` and ` ` tags. - For videos, each frame is resized to 448Γ—448. Frames are labeled with tags like `Frame-1` and enclosed in ` ` and ` ` tags, similar to images. The training pipeline for a single model in InternVL 2.5 is structured across three stages, designed to enhance the model's visual perception and multimodal capabilities. - Stage 1: MLP Warmup. In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training. - Stage 1.5: ViT Incremental Learning (Optional). This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced. - Stage 2: Full Model Instruction Tuning. The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete. We introduce a progressive scaling strategy to align the vision encoder with LLMs efficiently. This approach trains with smaller LLMs first (e.g., 20B) to optimize foundational visual capabilities and cross-modal alignment before transferring the vision encoder to larger LLMs (e.g., 72B) without retraining. This reuse skips intermediate stages for larger models. Compared to Qwen2-VL's 1.4 trillion tokens, InternVL2.5-78B uses only 120 billion tokensβ€”less than one-tenth. This strategy minimizes redundancy, maximizes pre-trained component reuse, and enables efficient training for complex vision-language tasks. To improve real-world adaptability and performance, we introduce two key techniques: - Random JPEG Compression: Random JPEG compression with quality levels between 75 and 100 is applied as a data augmentation technique. This simulates image degradation from internet sources, enhancing the model's robustness to noisy images. - Loss Reweighting: To balance the NTP loss across responses of different lengths, we use a reweighting strategy called square averaging. This method balances contributions from responses of varying lengths, mitigating biases toward longer or shorter responses. In InternVL 2.0 and 2.5, the organization of the training data is controlled by several key parameters to optimize the balance and distribution of datasets during training. - Data Augmentation: JPEG compression is applied conditionally: enabled for image datasets to enhance robustness and disabled for video datasets to maintain consistent frame quality. - Maximum Tile Number: The parameter `nmax` controls the maximum tiles per dataset. For example, higher values (24–36) are used for multi-image or high-resolution data, lower values (6–12) for standard images, and 1 for videos. - Repeat Factor: The repeat factor `r` adjusts dataset sampling frequency. Values below 1 reduce a dataset's weight, while values above 1 increase it. This ensures balanced training across tasks and prevents overfitting or underfitting. During development, we found that LLMs are highly sensitive to data noise, with even small anomaliesβ€”like outliers or repetitive dataβ€”causing abnormal behavior during inference. Repetitive generation, especially in long-form or CoT reasoning tasks, proved particularly harmful. To address this challenge and support future research, we designed an efficient data filtering pipeline to remove low-quality samples. The pipeline includes two modules, for pure-text data, three key strategies are used: 1. LLM-Based Quality Scoring: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data. 2. Repetition Detection: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns. 3. Heuristic Rule-Based Filtering: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal. 1. Repetition Detection: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process. 2. Heuristic Rule-Based Filtering: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity. As shown in the following figure, from InternVL 1.5 to 2.0 and then to 2.5, the fine-tuning data mixture has undergone iterative improvements in scale, quality, and diversity. For more information about the training data, please refer to our technical report. Comprehensive Multimodal & Hallucination Evaluation Training InternVL 2.0 models led to a decline in pure language capabilities. InternVL 2.5 addresses this by collecting more high-quality open-source data and filtering out low-quality data, achieving better preservation of pure language performance. We provide an example code to run `InternVL25-2B` using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained internlm25-18b-chat as a component, which is licensed under the Apache License 2.0. If you find this project useful in your research, please consider citing:

NaNK
license:mit
46,327
30

InternVL3-1B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) This is the SFT version of InternVL3-1B, which has undergone native multimodal pre-trainin and SFT but has not undergone MPO. If you're unsure which version to use, please use the InternVL3-1B version. We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. In the following table, we provide an overview of the InternVL3 series. | Model Name | Vision Part | Language Part | HF Link | | :-----------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :------------------------------------------------------: | | InternVL3-1B | InternViT-300M-448px-V25 | Qwen2.5-0.5B | πŸ€— link | | InternVL3-2B | InternViT-300M-448px-V25 | Qwen2.5-1.5B | πŸ€— link | | InternVL3-8B | InternViT-300M-448px-V25 | Qwen2.5-7B | πŸ€— link | | InternVL3-9B | InternViT-300M-448px-V25 | internlm3-8b-instruct | πŸ€— link | | InternVL3-14B | InternViT-300M-448px-V25 | Qwen2.5-14B | πŸ€— link | | InternVL3-38B | InternViT-6B-448px-V25 | Qwen2.5-32B | πŸ€— link | | InternVL3-78B | InternViT-6B-448px-V25 | Qwen2.5-72B | πŸ€— link | As shown in the following figure, InternVL3 retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors. We propose a Native Multimodal Pre-Training approach that consolidates language and vision learning into a single pre-training stage. In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules. Please see our paper for more details. In this phase, the techniques of random JPEG compression, square loss re-weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher-quality and more diverse training data. Specifically, we further extend training samples for tool use, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning. During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the model’s Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{\text{p}}\\), quality loss \\(\mathcal{L}{\text{q}}\\), and generation loss \\(\mathcal{L}{\text{g}}\\), which can be formulated as follows: $$ \mathcal{L}=w{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}{\text{g}}, $$ where \\(w{}\\) represents the weight assigned to each loss component. Please see our paper for more details about MPO. Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation. Comprehensive Multimodal & Hallucination Evaluation We compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. Please note that the evaluation scores of Qwen2.5 series may differ from those officially reported, as we have adopted the prompt versions provided in the table across all datasets for OpenCompass evaluation. We conduct experiments on the InternVL2-8B model while keeping its architecture, initialization parameters, and training data entirely unchanged. Traditionally, InternVL2-8B employs a training pipeline that begins with an MLP warmup phase for feature alignment followed by an Instruction Tuning stage. In our experiments, we substitute the conventional MLP warmup phase with a native multimodal pre-training process. This modification isolates the contribution of native multimodal pre-training to the overall multimodal capability of the model. The evaluation results in the Figure below shows that the model with native multimodal pre-training exhibits performance on most benchmarks that is comparable to the fully multi-stage-trained InternVL2-8B baseline. Furthermore, when followed by instruction tuning on higher-quality data, the model demonstrates further performance gains across evaluated multimodal tasks. These findings underscore the efficiency of native multimodal pre-training in imparting powerful multimodal capabilities to MLLMs. As shown in the table below, models fine-tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3-78B and InternVL3-38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data. As reported in the table below, the introduction of V2PE leads to significant performance gains across most evaluation metrics. In addition, our ablation studiesβ€”by varying the positional increment \\( \delta \\)β€”reveal that even for tasks primarily involving conventional contexts, relatively small \\( \delta \\) values can achieve optimal performance. These findings provide important insights for future efforts aimed at refining position encoding strategies for visual tokens in MLLMs. We provide an example code to run `InternVL3-1B` using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
46,206
9

InternVL3-38B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. In the following table, we provide an overview of the InternVL3 series. | Model Name | Vision Part | Language Part | HF Link | | :-----------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :------------------------------------------------------: | | InternVL3-1B | InternViT-300M-448px-V25 | Qwen2.5-0.5B | πŸ€— link | | InternVL3-2B | InternViT-300M-448px-V25 | Qwen2.5-1.5B | πŸ€— link | | InternVL3-8B | InternViT-300M-448px-V25 | Qwen2.5-7B | πŸ€— link | | InternVL3-9B | InternViT-300M-448px-V25 | internlm3-8b-instruct | πŸ€— link | | InternVL3-14B | InternViT-300M-448px-V25 | Qwen2.5-14B | πŸ€— link | | InternVL3-38B | InternViT-6B-448px-V25 | Qwen2.5-32B | πŸ€— link | | InternVL3-78B | InternViT-6B-448px-V25 | Qwen2.5-72B | πŸ€— link | As shown in the following figure, InternVL3 retains the same model architecture as InternVL 2.5 and its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 3 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. Notably, in InternVL3, we integrate the Variable Visual Position Encoding (V2PE), which utilizes smaller, more flexible position increments for visual tokens. Benefiting from V2PE, InternVL3 exhibits better long context understanding capabilities compared to its predecessors. We propose a Native Multimodal Pre-Training approach that consolidates language and vision learning into a single pre-training stage. In contrast to standard paradigms that first train a language-only model and subsequently adapt it to handle additional modalities, our method interleaves multimodal data (e.g., image-text, video-text, or image-text interleaved sequences) with large-scale textual corpora. This unified training scheme allows the model to learn both linguistic and multimodal representations simultaneously, ultimately enhancing its capability to handle vision-language tasks without the need for separate alignment or bridging modules. Please see our paper for more details. In this phase, the techniques of random JPEG compression, square loss re-weighting, and multimodal data packing proposed in InternVL2.5 are also employed in the InternVL3 series. The main advancement of the SFT phase in InternVL3 compared to InternVL2.5 lies in the use of higher-quality and more diverse training data. Specifically, we further extend training samples for tool use, 3D scene understanding, GUI operations, long context tasks, video understanding, scientific diagrams, creative writing, and multimodal reasoning. During Pre-training and SFT, the model is trained to predict the next token conditioned on previous ground-truth tokens. However, during inference, the model predicts each token based on its own prior outputs. This discrepancy between ground-truth tokens and model-predicted tokens introduces a distribution shift, which can impair the model’s Chain-of-Thought (CoT) reasoning capabilities. To mitigate this issue, we employ MPO, which introduces additional supervision from both positive and negative samples to align the model response distribution with the ground-truth distribution, thereby improving reasoning performance. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{\text{p}}\\), quality loss \\(\mathcal{L}{\text{q}}\\), and generation loss \\(\mathcal{L}{\text{g}}\\), which can be formulated as follows: $$ \mathcal{L}=w{p}\cdot\mathcal{L}{\text{p}} + w{q}\cdot\mathcal{L}{\text{q}} + w{g}\cdot\mathcal{L}{\text{g}}, $$ where \\(w{}\\) represents the weight assigned to each loss component. Please see our paper for more details about MPO. Test-Time Scaling has been shown to be an effective method to enhance the reasoning abilities of LLMs and MLLMs. In this work, we use the Best-of-N evaluation strategy and employ VisualPRM-8B as the critic model to select the best response for reasoning and mathematics evaluation. Comprehensive Multimodal & Hallucination Evaluation We compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. Please note that the evaluation scores of Qwen2.5 series may differ from those officially reported, as we have adopted the prompt versions provided in the table across all datasets for OpenCompass evaluation. We conduct experiments on the InternVL2-8B model while keeping its architecture, initialization parameters, and training data entirely unchanged. Traditionally, InternVL2-8B employs a training pipeline that begins with an MLP warmup phase for feature alignment followed by an Instruction Tuning stage. In our experiments, we substitute the conventional MLP warmup phase with a native multimodal pre-training process. This modification isolates the contribution of native multimodal pre-training to the overall multimodal capability of the model. The evaluation results in the Figure below shows that the model with native multimodal pre-training exhibits performance on most benchmarks that is comparable to the fully multi-stage-trained InternVL2-8B baseline. Furthermore, when followed by instruction tuning on higher-quality data, the model demonstrates further performance gains across evaluated multimodal tasks. These findings underscore the efficiency of native multimodal pre-training in imparting powerful multimodal capabilities to MLLMs. As shown in the table below, models fine-tuned with MPO demonstrate superior reasoning performance across seven multimodal reasoning benchmarks compared to their counterparts without MPO. Specifically, InternVL3-78B and InternVL3-38B outperform their counterparts by 4.1 and 4.5 points, respectively. Notably, the training data used for MPO is a subset of that used for SFT, indicating that the performance improvements primarily stem from the training algorithm rather than the training data. As reported in the table below, the introduction of V2PE leads to significant performance gains across most evaluation metrics. In addition, our ablation studiesβ€”by varying the positional increment \\( \delta \\)β€”reveal that even for tasks primarily involving conventional contexts, relatively small \\( \delta \\) values can achieve optimal performance. These findings provide important insights for future efforts aimed at refining position encoding strategies for visual tokens in MLLMs. We provide an example code to run `InternVL3-38B` using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
40,872
42

InternVL3_5-30B-A3B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
40,196
38

InternVL3_5-GPT-OSS-20B-A4B-Preview

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs and MLLMs. Please refer to the documentation for how to deploy internvl series. NOTE: Up to version 0.10.1.1, vLLM exhibits compatibility issues with GPT-OSS when applied in MLLMs. If you encounter any errors, please try replacing the `vllm/modelexecutor/models/gptoss.py` file with the following content: WARNING: Up to version 0.9.2, lmdeploy does not provide support for GPT-OSS. To deploy InternVL35-GPT-OSS-20B-Preview, we recommend using vLLM. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
40,039
76

InternVL3_5-1B-HF

NaNK
license:apache-2.0
39,263
5

InternVL3_5-4B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
38,330
2

InternViT-300M-448px-V2_5

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We are excited to announce the release of `InternViT-300M-448px-V25`, a significant enhancement built on the foundation of `InternViT-300M-448px`. By employing ViT incremental learning with NTP loss (Stage 1.5), the vision encoder has improved its ability to extract visual features, enabling it to capture more comprehensive information. This improvement is particularly noticeable in domains that are underrepresented in large-scale web datasets such as LAION-5B, including multilingual OCR data and mathematical charts, among others. In the following table, we provide an overview of the InternViT 2.5 series. | Model Name | HF Link | | :-----------------------: | :-------------------------------------------------------------------: | | InternViT-300M-448px-V25 | πŸ€— link | | InternViT-6B-448px-V25 | πŸ€— link | As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. In InternVL 2.0 and 2.5, we extend the dynamic high-resolution training approach, enhancing its capabilities to handle multi-image and video datasets. - For single-image datasets, the total number of tiles `nmax` are allocated to a single image for maximum resolution. Visual tokens are enclosed in ` ` and ` ` tags. - For multi-image datasets, the total number of tiles `nmax` are distributed across all images in a sample. Each image is labeled with auxiliary tags like `Image-1` and enclosed in ` ` and ` ` tags. - For videos, each frame is resized to 448Γ—448. Frames are labeled with tags like `Frame-1` and enclosed in ` ` and ` ` tags, similar to images. The training pipeline for a single model in InternVL 2.5 is structured across three stages, designed to enhance the model's visual perception and multimodal capabilities. - Stage 1: MLP Warmup. In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training. - Stage 1.5: ViT Incremental Learning (Optional). This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced. - Stage 2: Full Model Instruction Tuning. The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete. We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. The evaluation is divided into two key categories: (1) image classification, representing global-view semantic quality, and (2) semantic segmentation, capturing local-view semantic quality. This approach allows us to assess the representation quality of InternViT across its successive version updates. Please refer to our technical report for more details. Image classification performance across different versions of InternViT. We use IN-1K for training and evaluate on the IN-1K validation set as well as multiple ImageNet variants, including IN-ReaL, IN-V2, IN-A, IN-R, and IN-Sketch. Results are reported for both linear probing and attention pooling probing methods, with average accuracy for each method. βˆ† represents the performance gap between attention pooling probing and linear probing, where a larger βˆ† suggests a shift from learning simple linear features to capturing more complex, nonlinear semantic representations. Semantic segmentation performance across different versions of InternViT. The models are evaluated on ADE20K and COCO-Stuff-164K using three configurations: linear probing, head tuning, and full tuning. The table shows the mIoU scores for each configuration and their averages. βˆ†1 represents the gap between head tuning and linear probing, while βˆ†2 shows the gap between full tuning and linear probing. A larger βˆ† value indicates a shift from simple linear features to more complex, nonlinear representations. > \[!Warning\] > 🚨 Note: In our experience, the InternViT V2.5 series is better suited for building MLLMs than traditional computer vision tasks. If you find this project useful in your research, please consider citing:

NaNK
license:mit
33,773
48

InternVL3-8B-hf

[\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) > [!IMPORTANT] > This repository contains the Hugging Face πŸ€— Transformers implementation for the OpenGVLab/InternVL3-8B model. > It is intended to be functionally equivalent to the original OpenGVLab release. > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs. We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. You can find more info on the InternVL3 family in the original checkpoint OpenGVLab/InternVL3-8B Here is how you can use the `image-text-to-text` pipeline to perform inference with the `InternVL3` models in just a few lines of code: This example demonstrates how to perform inference on a single image with the InternVL models using chat templates. > [!NOTE] > Note that the model has been trained with a specific prompt format for chatting. Use `processor.applychattemplate(myconversationdict)` to correctly format your prompts. Text-only generation This example shows how to generate text using the InternVL model without providing any image input. Batched image and text inputs InternVL models also support batched image and text inputs. Batched multi-image input This implementation of the InternVL models supports batched text-images inputs with different number of images for each text. Video input InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates. Interleaved image and video inputs This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template. This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Qwen License. If you find this project useful in your research, please consider citing:

NaNK
β€”
32,964
9

InternVL2-8B

NaNK
license:mit
31,186
180

InternVL3-8B-Instruct

NaNK
license:apache-2.0
27,142
11

InternVL3-2B-hf

[\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) > [!IMPORTANT] > This repository contains the Hugging Face πŸ€— Transformers implementation for the OpenGVLab/InternVL3-2B model. > It is intended to be functionally equivalent to the original OpenGVLab release. > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs. We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more. Additionally, we compare InternVL3 with Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series. You can find more info on the InternVL3 family in the original checkpoint OpenGVLab/InternVL3-2B Here is how you can use the `image-text-to-text` pipeline to perform inference with the `InternVL3` models in just a few lines of code: This example demonstrates how to perform inference on a single image with the InternVL models using chat templates. > [!NOTE] > Note that the model has been trained with a specific prompt format for chatting. Use `processor.applychattemplate(myconversationdict)` to correctly format your prompts. Text-only generation This example shows how to generate text using the InternVL model without providing any image input. Batched image and text inputs InternVL models also support batched image and text inputs. Batched multi-image input This implementation of the InternVL models supports batched text-images inputs with different number of images for each text. Video input InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates. Interleaved image and video inputs This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template. This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Qwen License. If you find this project useful in your research, please consider citing:

NaNK
β€”
27,067
3

InternVideo2_5_Chat_8B

NaNK
license:apache-2.0
17,992
85

InternVL2_5-1B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We are excited to introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In the following table, we provide an overview of the InternVL 2.5 series. | Model Name | Vision Part | Language Part | HF Link | | :-------------: | :-------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------: | :---------------------------------------------------------: | | InternVL25-1B | InternViT-300M-448px-V25 | Qwen2.5-0.5B-Instruct | πŸ€— link | | InternVL25-2B | InternViT-300M-448px-V25 | internlm25-18b-chat | πŸ€— link | | InternVL25-4B | InternViT-300M-448px-V25 | Qwen2.5-3B-Instruct | πŸ€— link | | InternVL25-8B | InternViT-300M-448px-V25 | internlm25-7b-chat | πŸ€— link | | InternVL25-26B | InternViT-6B-448px-V25 | internlm25-20b-chat | πŸ€— link | | InternVL25-38B | InternViT-6B-448px-V25 | Qwen2.5-32B-Instruct | πŸ€— link | | InternVL25-78B | InternViT-6B-448px-V25 | Qwen2.5-72B-Instruct | πŸ€— link | As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector. As in the previous version, we applied a pixel unshuffle operation, reducing the number of visual tokens to one-quarter of the original. Besides, we adopted a similar dynamic resolution strategy as InternVL 1.5, dividing images into tiles of 448Γ—448 pixels. The key difference, starting from InternVL 2.0, is that we additionally introduced support for multi-image and video data. In InternVL 2.0 and 2.5, we extend the dynamic high-resolution training approach, enhancing its capabilities to handle multi-image and video datasets. - For single-image datasets, the total number of tiles `nmax` are allocated to a single image for maximum resolution. Visual tokens are enclosed in ` ` and ` ` tags. - For multi-image datasets, the total number of tiles `nmax` are distributed across all images in a sample. Each image is labeled with auxiliary tags like `Image-1` and enclosed in ` ` and ` ` tags. - For videos, each frame is resized to 448Γ—448. Frames are labeled with tags like `Frame-1` and enclosed in ` ` and ` ` tags, similar to images. The training pipeline for a single model in InternVL 2.5 is structured across three stages, designed to enhance the model's visual perception and multimodal capabilities. - Stage 1: MLP Warmup. In this stage, only the MLP projector is trained while the vision encoder and language model are frozen. A dynamic high-resolution training strategy is applied for better performance, despite increased cost. This phase ensures robust cross-modal alignment and prepares the model for stable multimodal training. - Stage 1.5: ViT Incremental Learning (Optional). This stage allows incremental training of the vision encoder and MLP projector using the same data as Stage 1. It enhances the encoder’s ability to handle rare domains like multilingual OCR and mathematical charts. Once trained, the encoder can be reused across LLMs without retraining, making this stage optional unless new domains are introduced. - Stage 2: Full Model Instruction Tuning. The entire model is trained on high-quality multimodal instruction datasets. Strict data quality controls are enforced to prevent degradation of the LLM, as noisy data can cause issues like repetitive or incorrect outputs. After this stage, the training process is complete. We introduce a progressive scaling strategy to align the vision encoder with LLMs efficiently. This approach trains with smaller LLMs first (e.g., 20B) to optimize foundational visual capabilities and cross-modal alignment before transferring the vision encoder to larger LLMs (e.g., 72B) without retraining. This reuse skips intermediate stages for larger models. Compared to Qwen2-VL's 1.4 trillion tokens, InternVL2.5-78B uses only 120 billion tokensβ€”less than one-tenth. This strategy minimizes redundancy, maximizes pre-trained component reuse, and enables efficient training for complex vision-language tasks. To improve real-world adaptability and performance, we introduce two key techniques: - Random JPEG Compression: Random JPEG compression with quality levels between 75 and 100 is applied as a data augmentation technique. This simulates image degradation from internet sources, enhancing the model's robustness to noisy images. - Loss Reweighting: To balance the NTP loss across responses of different lengths, we use a reweighting strategy called square averaging. This method balances contributions from responses of varying lengths, mitigating biases toward longer or shorter responses. In InternVL 2.0 and 2.5, the organization of the training data is controlled by several key parameters to optimize the balance and distribution of datasets during training. - Data Augmentation: JPEG compression is applied conditionally: enabled for image datasets to enhance robustness and disabled for video datasets to maintain consistent frame quality. - Maximum Tile Number: The parameter `nmax` controls the maximum tiles per dataset. For example, higher values (24–36) are used for multi-image or high-resolution data, lower values (6–12) for standard images, and 1 for videos. - Repeat Factor: The repeat factor `r` adjusts dataset sampling frequency. Values below 1 reduce a dataset's weight, while values above 1 increase it. This ensures balanced training across tasks and prevents overfitting or underfitting. During development, we found that LLMs are highly sensitive to data noise, with even small anomaliesβ€”like outliers or repetitive dataβ€”causing abnormal behavior during inference. Repetitive generation, especially in long-form or CoT reasoning tasks, proved particularly harmful. To address this challenge and support future research, we designed an efficient data filtering pipeline to remove low-quality samples. The pipeline includes two modules, for pure-text data, three key strategies are used: 1. LLM-Based Quality Scoring: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data. 2. Repetition Detection: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns. 3. Heuristic Rule-Based Filtering: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal. 1. Repetition Detection: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process. 2. Heuristic Rule-Based Filtering: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity. As shown in the following figure, from InternVL 1.5 to 2.0 and then to 2.5, the fine-tuning data mixture has undergone iterative improvements in scale, quality, and diversity. For more information about the training data, please refer to our technical report. Comprehensive Multimodal & Hallucination Evaluation Training InternVL 2.0 models led to a decline in pure language capabilities. InternVL 2.5 addresses this by collecting more high-quality open-source data and filtering out low-quality data, achieving better preservation of pure language performance. We provide an example code to run `InternVL25-1B` using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained Qwen2.5-0.5B-Instruct as a component, which is licensed under the Apache License 2.0. If you find this project useful in your research, please consider citing:

NaNK
license:mit
17,800
62

InternVL3_5-4B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
17,069
21

VideoMAEv2-Base

license:cc-by-nc-4.0
17,043
8

InternVL2_5-8B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ Mini-Int...

NaNK
license:mit
16,478
96

InternVL2-2B-AWQ

NaNK
license:mit
14,933
16

InternVL3_5-14B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
12,783
26

InternVL3_5-8B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
11,225
12

InternVL3_5-38B

NaNK
license:apache-2.0
10,587
34

Mono-InternVL-2B

NaNK
license:mit
8,430
36

InternVL3_5-2B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
8,307
16

InternVL3_5-8B-HF

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B-HF` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. The HuggingFace format checkpoints of our models are fully consistent with the APIs of the official HuggingFace models. For details, please refer to the official documentation. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
6,849
7

InternVL3-78B

NaNK
β€”
6,777
223

InternVL3_5-241B-A28B

NaNK
license:apache-2.0
6,104
131

VideoMAEv2-Large

license:cc-by-nc-4.0
5,975
1

InternViT-300M-448px

NaNK
license:mit
5,847
60

InternVL3-38B-Instruct

NaNK
license:apache-2.0
5,826
9

ViCLIP-B-16-hf

NaNK
β€”
5,195
1

InternVL3-9B

NaNK
license:mit
5,167
25

InternVL3_5-2B-Flash

NaNK
license:apache-2.0
5,122
3

InternVL3_5-2B-HF

NaNK
license:apache-2.0
5,112
2

InternVL3_5-1B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
4,964
6

InternVL2-4B

NaNK
license:mit
4,801
53

InternVL3-2B-Instruct

NaNK
license:apache-2.0
4,491
6

InternVL2_5-2B-MPO-hf

NaNK
β€”
4,459
0

Mini-InternVL-Chat-2B-V1-5

NaNK
license:mit
4,452
73

InternVL2_5-4B

NaNK
license:mit
4,227
55

VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B

NaNK
license:apache-2.0
3,908
5

InternVL3-8B-AWQ

NaNK
β€”
3,589
8

InternVL2-26B

NaNK
license:mit
3,106
117

InternVL3_5-30B-A3B-Flash

NaNK
license:apache-2.0
3,067
5

pvt_v2_b0

license:apache-2.0
2,977
2

InternVL-Chat-V1-5

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://internvl.opengvlab.com/) [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) > Two interns holding hands, symbolizing the integration of InternViT and InternLM. We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 Γ— 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input during inference. 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. - Model Type: multimodal large language model (MLLM) - Architecture: InternViT-6B-448px-V1-5 + MLP + InternLM2-Chat-20B - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution). - Params: 25.5B - Learnable component in the pre-training stage: ViT + MLP - Learnable component in the fine-tuning stage: ViT + MLP + LLM - For more details on training hyperparameters, please see our blog. - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit. Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information. We provide an example code to run InternVL-Chat-V1-5 using `transformers`. > Please use transformers>=4.37.2 to ensure the model works normally. > ⚠️ Warning: Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization. The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTurner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. If `ImportError` occurs while executing this case, please install the required dependency packages as prompted. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the MIT License. This project uses the pre-trained internlm2-chat-20b as a component, which is licensed under the Apache License 2.0. If you find this project useful in your research, please consider citing:

license:mit
2,900
416

InternVL3_5-38B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
2,846
4

VideoChat-Flash-Qwen2-7B_res448

NaNK
license:apache-2.0
2,800
12

InternVL3-14B-hf

NaNK
β€”
2,654
0

InternVL3_5-30B-A3B-HF

NaNK
license:apache-2.0
2,246
5

InternVL3-78B-hf

NaNK
β€”
2,142
2

InternViT-6B-448px-V1-5

NaNK
license:mit
2,039
77

InternVL2_5-1B-MPO

NaNK
license:mit
2,029
24

InternVL3_5-38B-HF

NaNK
license:apache-2.0
2,023
5

InternVL3_5-14B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
1,915
8

VideoMAEv2-giant

license:cc-by-nc-4.0
1,871
4

InternVL3_5-8B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
1,629
3

InternVL3_5-4B-HF

NaNK
license:apache-2.0
1,512
3

InternVL_2_5_HiCo_R16

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5) [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2501.12386) InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with long and rich context (LRC) modeling. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). This model is a variant of InternVideo2.5's ablation experiment, built on HiCo technology only (R16 means 16 tokens per frame). πŸ“ˆ Performance | Model | MVBench | LongVideoBench | VideoMME(w/o sub)| | --- | --- | --- | --- | |InternVL2.5HiCoR16| 74.0 | 59.6 | 64.9| First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

license:apache-2.0
1,485
6

InternVL3_5-4B-Flash

NaNK
license:apache-2.0
1,396
3

InternVL3-1B-Pretrained

NaNK
license:apache-2.0
1,310
3

InternVL3-8B-Pretrained

NaNK
license:apache-2.0
1,303
0

InternVL3-14B-Instruct

NaNK
license:apache-2.0
1,294
9

InternVL-Chat-V1-2

license:mit
1,293
17

InternVideo2_CLIP_S

license:apache-2.0
1,268
1

InternVL3_5-14B-HF

NaNK
license:apache-2.0
1,238
3

InternVL3_5-1B-Flash

NaNK
license:apache-2.0
1,160
4

InternVL2_5-8B-MPO

NaNK
license:mit
1,089
48

InternVL2_5-38B-MPO

NaNK
license:mit
1,036
20

VideoChat-R1_5-7B

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/VideoChat-R1) [\[πŸ“œ Tech Report\]](https://arxiv.org/pdf/2509.21100v1) Using qwenvlutils in https://github.com/OpenGVLab/VideoChat-R1/blob/main/Videochat-R1.5/srceval/myvisionprocess.py If you find this project useful in your research, please consider cite:

NaNK
license:apache-2.0
987
7

InternVL3-78B-Instruct

NaNK
β€”
925
8

VideoChat-Flash-Qwen2_5-2B_res448

NaNK
license:apache-2.0
828
26

InternViT 6B 448px V2 5

NaNK
license:mit
752
47

InternVL3_5-1B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
749
2

InternVL3 5 8B Flash

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
745
4

InternVL2_5-8B-MPO-hf

NaNK
β€”
744
0

InternVL2_5-26B

NaNK
license:mit
711
30

InternVL2_5-38B

NaNK
license:mit
705
49

VideoChat-R1_7B

NaNK
license:apache-2.0
558
8

InternVL3_5-8B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
543
2

InternVL3_5-4B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
534
2

InternVL-Chat-V1-5-AWQ

NaNK
license:mit
523
10

VideoMAEv2-Huge

license:cc-by-nc-4.0
521
1

InternVL-14B-224px

NaNK
license:mit
511
35

InternVL3-38B-hf

NaNK
β€”
446
3

SDLM-3B-D8

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/SDLM) [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2509.24007) [\\[πŸš€ Project Page\\]](https://internvl.github.io/blog/2025-09-29-SDLM/) [\[πŸ€— HuggingFace\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552) We propose a Sequential Diffusion Language Model (SDLM), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning. In the following table, we provide an overview of the SDLM series. | Model Name | Base Model πŸ€— | HF Link πŸ€— | | ----------- | ------------------------------------------------------------ | -------------------------------------------- | | SDLM-3B-D4 | Qwen2.5-3B | https://huggingface.co/OpenGVLab/SDLM-3B-D4 | | SDLM-3B-D8 | Qwen2.5-3B | https://huggingface.co/OpenGVLab/SDLM-3B-D8 | | SDLM-32B-D4 | Qwen2.5-32B | https://huggingface.co/OpenGVLab/SDLM-32B-D4 | We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy. (a) Training pipeline. Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right). (b) Sampling Pipeline. Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding. SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark. Trade-off between performance and speed under different confidence thresholds Ο„ for SDLM-3B (B=4) and SDLM-3B (B=8). By adjusting Ο„, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass. 2. Download the model generation script sdlminference.py to your working directory. 3. We provide an example code to run `SDLM-3B-D8` using `transformers`. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
400
3

SDLM-3B-D4

NaNK
license:apache-2.0
390
4

InternVL3_5-2B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
377
5

ScaleCUA-3B

NaNK
license:apache-2.0
375
9

SDLM-32B-D4

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/SDLM) [\[πŸ“œ Tech Report\]](https://arxiv.org/abs/2509.24007) πŸš€ Project Page [\[πŸ€— HuggingFace Collection\]](https://huggingface.co/collections/OpenGVLab/sdlm-68ac82709d7c343ad36aa552) We propose a Sequential Diffusion Language Model (SDLM), to cheaply stimulate the parallel prediction capabilities of diffusion models. Specifically, SDLM reduces distribution shift by limiting the prediction range to a fixed block length and enforces decoding order through the longest prefix decoding method, thereby significantly improving prediction efficiency while ensuring generation quality. Our method can be viewed as a further generalization of the autoregressive (AR) paradigm. Therefore, it is possible to use pre-trained AR weights and quickly migrate to the diffusion framework with only minimal instruction fine-tuning. SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark. - Autoregression: Predicts tokens one by one. - Diffusion: Regenerates all tokens each step. - SDLM (ours): Decodes D tokens per step, then keeps the longest consecutive n confident tokens (1 ≀ n ≀ D). Cached tokens are reused, saving computation. In the following table, we provide an overview of the SDLM series. | Model Name | Base Model πŸ€— | HF Link πŸ€— | | :--------- | :----------------------------------------------------------- | :-------------------------------------------- | | SDLM-3B-D4 | Qwen2.5-3B | https://huggingface.co/OpenGVLab/SDLM-3B-D4 | | SDLM-3B-D8 | Qwen2.5-3B | https://huggingface.co/OpenGVLab/SDLM-3B-D8 | | SDLM-32B-D4 | Qwen2.5-32B | https://huggingface.co/OpenGVLab/SDLM-32B-D4 | We propose a sequential blockwise masked prediction method that reduces error accumulation in diffusion-based generation. Our method leverages the observation that predictions for tokens at lower positional indices typically benefit from more reliable contextual information, resulting in lower deviation and improved accuracy. (a) Training pipeline. Reordered input enables structured mask with causal prefix (top-left), visible cross-block prefix (bottom-left), and intra-block bidirectional attention (bottom-right). (b) Sampling Pipeline. Confidence-based dynamic block decoding with KV cache reuse. At each step, a block of B tokens is predicted with B-1 padding masks. The longest high-confidence prefix is selected as dynamic output. Cached KV states enable efficient decoding. SDLM delivers strong performance with significantly faster decoding speed. It operates approximately 2x faster than comparable autoregressive models while matching their accuracy, and achieves up to 5x speedup over other diffusion language models, as evidenced by results on the MATH-500 benchmark. Trade-off between performance and speed under different confidence thresholds Ο„ for SDLM-3B (B=4) and SDLM-3B (B=8). By adjusting Ο„, a controllable trade-off between speed and performance can be achieved. SpeedUp denotes the average number of tokens output per forward pass. 2. Download the model generation script sdlminference.py to your working directory. 3. We provide an example code to run `SDLM-32B-D4` using `transformers`. Note: Additional setup is required if using Flex Attention. The training dataset we used is specified in the meta file: meta.json and is organized in the ShareGPT style, according to the InternVL chat data format. This dataset is composed of several open-source datasets, with the following structure: | Dataset Name | # Sample | Domain | | :----------------------------------------------------------------------------------------- | :--------- | :------- | | ScaleQuest-Math | 1,000K | Math | | Opc-sft-stage2 | 436K | Code | | Smoltalk | 1,100K | General | | Tulu-3-sft-mixture | 939K | General | | SciRIFF | 79K | Scienece| | Table-GPT | 13K | Table | | Total | 3,506K | -- | All training scripts are available in the shell/train directory. Key parameters include: `blocksize`: The size of the diffusion window. Current settings use `4`, we also try to use `8`; larger sizes are under exploration. `attnimplementation`: Attention implementation type. Options include sdpa, eager, or flexattn. Using Flex Attention requires additional setup. Prefer to use `sdpa` for a quick start. `causalattn`: Whether to use causal attention within the window. Currently set to non-causal (`False`). More details about training please refer to github. Currently, we use Opencompass for evaluation. For more details, please refer to the evaluation guide. We extend our gratitude to the open-source community for their foundational contributions: InternVL The codebase we build upon. SMDM, LLaDA, Dream, Block Diffusion for insights into diffusion-based generative modeling. Qwen2.5 as a robust base model for comparative studies. Opencompass for providing a comprehensive evaluation framework. The creators of all datasets used in this work, enabling rigorous training and validation. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
373
11

InternVL3-38B-AWQ

NaNK
β€”
361
4

InternVL3_5-1B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
359
1

InternVL3_5-2B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
358
2

InternVL2_5-4B-MPO-AWQ

NaNK
license:mit
357
5

InternVideo2-Stage2_6B

NaNK
license:mit
357
1

InternVL2_5-26B-AWQ

NaNK
license:mit
354
10

Mini-InternVL-Chat-4B-V1-5

NaNK
license:mit
347
62

ViCLIP-L-14-hf

NaNK
β€”
340
1

InternVL2-Llama3-76B

NaNK
base_model:NousResearch/Hermes-2-Theta-Llama-3-70B
338
211

VideoChat2_HD_stage4_Mistral_7B_hf

NaNK
license:mit
326
3

InternVideo2-Chat-8B

NaNK
license:mit
310
23

VideoChat-R1-thinking_7B

NaNK
license:apache-2.0
302
0

VisualPRM-8B

NaNK
license:mit
293
17

InternVL2-40B

NaNK
license:mit
290
93

pvt_v2_b2

license:apache-2.0
290
1

InternVL3-9B-Instruct

NaNK
license:mit
282
4

Mini-InternVL2-4B-DA-DriveLM

NaNK
license:mit
277
3

InternVL2_5-2B-MPO

NaNK
license:mit
267
12

internimage_b_1k_224

license:mit
263
1

InternVL3_5-14B-Flash

NaNK
license:apache-2.0
262
5

InternVL2_5-78B

NaNK
β€”
246
192

InternVL3_5-30B-A3B-Instruct

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
235
10

InternVL3-78B-AWQ

NaNK
β€”
221
10

ScaleCUA-7B

NaNK
license:apache-2.0
203
8

ASMv2

license:apache-2.0
197
16

InternVL2_5-8B-AWQ

NaNK
license:mit
195
7

InternVL3_5-14B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
189
3

internimage_s_1k_224

license:mit
187
1

InternVL3_5-14B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
180
1

InternVL2_5-78B-AWQ

NaNK
β€”
176
14

InternVL3-14B-AWQ

NaNK
β€”
174
7

InternOmni

NaNK
license:mit
173
6

InternVL3-1B-AWQ

NaNK
β€”
171
2

InternVL2_5-38B-AWQ

NaNK
license:mit
167
9

InternVL3-2B-AWQ

NaNK
β€”
162
1

internimage_t_1k_224

license:mit
159
2

InternVL2-26B-AWQ

NaNK
license:mit
157
21

InternViT-6B-224px

NaNK
license:mit
148
24

VideoChat-Flash-Qwen2-7B_res224

NaNK
license:apache-2.0
148
7

Vlaser-2B

NaNK
β€”
142
1

VideoChat-R1_7B_caption

NaNK
license:apache-2.0
136
4

InternVL-Chat-V1-1

NaNK
base_model:meta-llama/Llama-2-13b-hf
134
13

InternVideo2_chat_8B_HD

NaNK
license:mit
124
18

InternVL3_5-241B-A28B-HF

NaNK
license:apache-2.0
124
11

Mini-InternVL2-2B-DA-DriveLM

NaNK
license:mit
124
0

InternVL2-40B-AWQ

NaNK
license:mit
117
18

InternVL3_5-241B-A28B-Flash

NaNK
license:apache-2.0
117
4

Mini-InternVL2-1B-DA-DriveLM

NaNK
license:mit
114
1

InternVL-Chat-ViT-6B-Vicuna-7B

NaNK
β€”
108
8

InternVL2_5-78B-MPO

NaNK
β€”
102
54

InternVL-Chat-V1-2-Plus

NaNK
license:mit
100
34

InternVL2-8B-AWQ

NaNK
license:mit
100
13

HoVLE

NaNK
license:mit
99
13

InternVL-Chat-ViT-6B-Vicuna-13B

NaNK
β€”
98
7

InternVL3_5-30B-A3B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
96
1

InternVL3-2B-Pretrained

NaNK
license:apache-2.0
95
1

InternVL2_5-78B-MPO-AWQ

NaNK
β€”
93
9

InternVL3-9B-Pretrained

NaNK
license:mit
93
0

internimage_xl_22k_384

license:mit
91
2

InternVL3 5 38B Flash

NaNK
license:apache-2.0
89
5

InternVL3_5-38B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
89
2

InternVL2-8B-MPO

NaNK
license:mit
88
37

InternVL2_5-26B-MPO

NaNK
license:mit
88
14

InternVL3-14B-Pretrained

NaNK
license:apache-2.0
88
1

InternVL-Chat-ViT-6B-Vicuna-13B-448px

NaNK
β€”
87
4

InternVL3_5-30B-A3B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
85
3

stage2-InternVideo2-1B-Qwen2_5-7B-tome16_mlp

NaNK
license:apache-2.0
83
0

NaViL-2B

NaNK
license:mit
83
0

InternVL2_5-38B-MPO-AWQ

NaNK
license:mit
82
6

ASMv2-Stage2-Pretrain

license:apache-2.0
82
3

ASMv2-Stage1-Ft

license:apache-2.0
82
3

cllm_td_opt

license:apache-2.0
81
0

stage2-UMT-Qwen2-7B-tome16_mlp

NaNK
license:apache-2.0
81
0

NaViL-9B

NaNK
license:mit
81
0

ScaleCUA-32B

NaNK
license:apache-2.0
80
20

InternVL-14B-Flickr30K-FT-364px

NaNK
license:mit
80
7

InternVL_2_5_HiCo_R64

license:apache-2.0
80
3

stage2-UMT-Qwen2_5_7B_1m-tome16_mlp

NaNK
license:apache-2.0
79
0

InternVL3-38B-Pretrained

NaNK
license:apache-2.0
78
1

stage2-UMT-Qwen2_5_1.5B-tome16_mlp

NaNK
license:apache-2.0
78
0

InternVL2_5-8B-MPO-AWQ

NaNK
license:mit
77
6

InternVL3_5-241B-A28B-MPO

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
77
2

Mini-InternVL2-1B-DA-Medical

NaNK
license:mit
76
1

internimage_h_jointto22k_384

license:mit
75
2

InternVL2-Llama3-76B-AWQ

NaNK
base_model:OpenGVLab/InternVL2-Llama3-76B
73
25

internimage_h_22kto1k_640

license:mit
73
2

InternVL3_5-38B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
73
2

PVC-InternVL2-8B

NaNK
license:mit
72
9

Mini-InternVL2-4B-DA-Medical

NaNK
license:mit
72
6

Mono-InternVL-2B-S1-3

NaNK
license:mit
72
1

InternViT-6B-448px-V1-0

NaNK
license:mit
71
9

VisualPRM-8B-v1_1

NaNK
license:mit
70
9

internimage_l_22k_384

license:mit
70
1

InternVL3-78B-Pretrained

NaNK
β€”
70
1

Mini-InternVL2-2B-DA-Medical

NaNK
license:mit
70
0

Docopilot-8B

NaNK
license:mit
69
3

Mono-InternVL-2B-S1-2

NaNK
license:mit
69
1

InternVL3_5-241B-A28B-Pretrained

[\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | πŸ€— link | πŸ€– link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | πŸ€— link | πŸ€– link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | πŸ€— link | πŸ€– link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | πŸ€— link | πŸ€– link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | πŸ€— link | πŸ€– link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | πŸ€— link | πŸ€– link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | πŸ€— link | πŸ€– link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-1B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-2B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-4B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-8B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-14B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-38B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Pretrained | CPT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | πŸ€— link | πŸ€– link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | πŸ€— link | πŸ€– link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
69
1

Mini-InternVL2-1B-DA-BDD

NaNK
license:mit
69
0

Mono-InternVL-2B-S1-1

NaNK
license:mit
69
0

Docopilot-2B

NaNK
license:mit
68
8

InternVideo2_Chat_8B_InternLM2_5

NaNK
license:mit
68
7

internimage_g_jointto22k_384

license:mit
67
1

InternViT-6B-448px-V1-2

NaNK
license:mit
66
26

HoVLE-HD

NaNK
license:mit
66
8

InternVL2_5-26B-MPO-AWQ

NaNK
license:mit
66
4

internimage_l_22kto1k_384

license:mit
66
1

InternVL3-9B-AWQ

NaNK
license:mit
66
1

Mini-InternVL2-2B-DA-BDD

NaNK
license:mit
65
1

internimage_xl_22kto1k_384

license:mit
65
1

Mini-InternVL2-4B-DA-BDD

NaNK
license:mit
65
0

VideoChat-Flash-Qwen2_5-7B-1M_res224

NaNK
license:apache-2.0
64
2

VeBrain

license:apache-2.0
64
0

internimage_g_22kto1k_512

license:mit
62
1

VideoChat-TPO

NaNK
license:mit
51
5

pvt_v2_b1

license:apache-2.0
46
1

Vlaser 8B

NaNK
β€”
44
2

ZeroGUI-OSWorld-7B

NaNK
license:apache-2.0
41
6

VisionLLMv2

license:apache-2.0
40
6

InternVideo2_distillation_models

β€”
38
1

ASM-FT

license:apache-2.0
36
6

InternVideo2_chat_8B_HD_F16

NaNK
license:mit
36
2

InternVL-14B-FlickrCN-FT-364px

NaNK
license:mit
35
3

pvt_v2_b5

license:apache-2.0
35
1

PIIP-LLaVA_ConvNeXt-B_CLIP-L_640-224_7B

NaNK
llava_llama
35
0

ZeroGUI-AndroidLab-7B

NaNK
license:apache-2.0
34
5

pvt_v2_b3

license:apache-2.0
34
2

pvt_v2_b2_linear

license:apache-2.0
34
1

pvt_v2_b4

license:apache-2.0
34
1

ASM-Pretrain

license:apache-2.0
33
3

PIIP-LLaVA_CLIP-BL_512-448_7B

NaNK
llava_llama
33
0

clip-vit-large-patch14to16-336

β€”
33
0

clip-vit-large-patch14to16-224

β€”
33
0

PIIP-LLaVA_CLIP-BL_512-448_13B

NaNK
llava_llama
32
0

PIIP-LLaVA-Plus_ConvNeXt-L_CLIP-L_1024-336_7B

NaNK
llava_llama
32
0

PIIP-LLaVA_ConvNeXt-L_CLIP-L_1024-336_13B

NaNK
llava_llama
32
0

PIIP-LLaVA_ConvNeXt-B_CLIP-L_1024-336_7B

NaNK
llava_llama
32
0

PIIP-LLaVA_CLIP-BL_512-256_7B

NaNK
llava_llama
32
0

LCL-ViT-B-16-Laion

license:mit
31
0

PIIP-LLaVA_ConvNeXt-B_CLIP-L_1024-336_13B

NaNK
llava_llama
31
0

PIIP-LLaVA_ConvNeXt-L_CLIP-L_1024-336_7B

NaNK
llava_llama
31
0

ViCLIP

license:mit
0
46

InternVL

license:mit
0
36

VideoMamba

license:apache-2.0
0
28

InternImage

license:mit
0
18

InternVideo2 Stage2 1B 224p F4

NaNK
license:apache-2.0
0
17

MM-Interleaved

license:apache-2.0
0
11

InternVL2-Pretrain-Models

license:mit
0
11

Vision-RWKV

license:apache-2.0
0
10

VideoMAE2

license:apache-2.0
0
8

Hulk

license:mit
0
7

InternVL2_5-Pretrain-Models

license:mit
0
7

InternVideo2-Stage2_6B-224p-f4

NaNK
license:apache-2.0
0
6

DCNv4

β€”
0
5

InternVideo2-Stage1-1B-224p-f8

NaNK
license:apache-2.0
0
5

PIIP

license:mit
0
5

InternVideo2-Stage1-1B-224p-K400

NaNK
license:apache-2.0
0
4

VideoChat2_stage3_Mistral_7B

NaNK
license:apache-2.0
0
4

V2PE

NaNK
license:mit
0
4

Video_Encoders_for_Training_VideoChat-Flash

license:apache-2.0
0
4

InternVideo2-CLIP-1B-224p-f8

NaNK
license:apache-2.0
0
3

videochat2

license:mit
0
3

Vlaser 2B VLA

NaNK
β€”
0
3

PATH-ViTB

β€”
0
2

InternVideo2-Stage1-1B-224p-f8-MiT

NaNK
license:apache-2.0
0
2

VideoChat2_stage2_Mistral_7B

NaNK
license:apache-2.0
0
2

UMT

license:mit
0
2

InternVideo2-Stage2-6B-Audio

NaNK
license:apache-2.0
0
2

ScaleCUA_Env

license:apache-2.0
0
2

InternVideo2-Stage1-1B-224p-f8-k710

NaNK
license:apache-2.0
0
1

InternVideo2-Stage1-1B-224p-K600

NaNK
license:apache-2.0
0
1

InternVideo2-Stage1-1B-224p-K700

NaNK
license:apache-2.0
0
1

VideoChat2_HD_stage4_Mistral_7B

NaNK
license:apache-2.0
0
1

VBench_Used_Models

license:apache-2.0
0
1

InternVideo1.0

license:mit
0
1

InternVideo2-CLIP-6B-224p-f8

NaNK
license:apache-2.0
0
1