Qwen

✓ VerifiedAI Startup

Alibaba Cloud's Qwen (Tongyi Qianwen) model family

390 models • 149 total models in database

Sort by:

Qwen2.5-7B-Instruct

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 7B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

—

9,345,395

862

Qwen2.5-VL-3B-Instruct

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MMMU val | 52.3 | 54.1 | 53.1| | MMMU-Pro val | 32.7 | 30.5 | 31.6| | AI2D test | 81.4 | 83.0 | 81.5 | | DocVQA test | 91.6 | 94.5 | 93.9 | | InfoVQA test | 72.1 | 76.5 | 77.1 | | TextVQA val | 76.8 | 84.3 | 79.3| | MMBench-V1.1 test | 79.3 | 80.7 | 77.6 | | MMStar | 58.3 | 60.7 | 55.9 | | MathVista testmini | 60.5 | 58.2 | 62.3 | | MathVision full | 20.9 | 16.3 | 21.2 | Video benchmark | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MVBench | 71.6 | 67.0 | 67.0 | | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 | | MLVU | 48.3 | - | 68.2 | | LVBench | - | - | 43.3 | | MMBench-Video | 1.73 | 1.44 | 1.63 | | EgoSchema | - | - | 64.8 | | PerceptionTest | - | - | 66.9 | | TempCompass | - | - | 64.4 | | LongVideoBench | 55.2 | 55.6 | 54.2 | | CharadesSTA/mIoU | - | - | 38.8 | Agent benchmark | Benchmarks | Qwen2.5-VL-3B | |-------------------------|---------------| | ScreenSpot | 55.5 | | ScreenSpot Pro | 23.9 | | AITZEM | 76.9 | | Android Control HighEM | 63.7 | | Android Control LowEM | 22.2 | | AndroidWorldSR | 90.8 | | MobileMiniWob++SR | 67.9 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.

—

8,219,178

551

Qwen3-0.6B

—

7,302,619

777

Qwen3-4B-Instruct-2507

--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE pipeline_tag: text-generation ---

—

5,030,248

464

Qwen2.5-VL-7B-Instruct

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |Qwen2.5-VL-7B | | :--- | :---: | :---: | :---: | :---: | :---: | | MMMU val | 56 | 50.4 | 60| 54.1 | 58.6| | MMMU-Pro val | 34.3 | - | 37.6| 30.5 | 41.0| | DocVQA test | 93 | 93 | - | 94.5 | 95.7 | | InfoVQA test | 77.6 | - | - |76.5 | 82.6 | | ChartQA test | 84.8 | - |- | 83.0 |87.3 | | TextVQA val | 79.1 | 80.1 | -| 84.3 | 84.9| | OCRBench | 822 | 852 | 785 | 845 | 864 | | CCOCR | 57.7 | | | 61.6 | 77.8| | MMStar | 62.8| | |60.7| 63.9| | MMBench-V1.1-En test | 79.4 | 78.0 | 76.0| 80.7 | 82.6 | | MMT-Bench test | - | - | - |63.7 |63.6 | | MMStar | 61.5 | 57.5 | 54.8 | 60.7 |63.9 | | MMVet GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | 67.1| | HallBench avg | 45.2 | 48.1 | 46.1| 50.6 | 52.9| | MathVista testmini | 58.3 | 60.6 | 52.4 | 58.2 | 68.2| | MathVision | - | - | - | 16.3 | 25.07 | | Benchmark | Qwen2-VL-7B | Qwen2.5-VL-7B | | :--- | :---: | :---: | | MVBench | 67.0 | 69.6 | | PerceptionTest test | 66.9 | 70.5 | | Video-MME wo/w subs | 63.3/69.0 | 65.1/71.6 | | LVBench | | 45.3 | | LongVideoBench | | 54.7 | | MMBench-Video | 1.44 | 1.79 | | TempCompass | | 71.7 | | MLVU | | 70.2 | | CharadesSTA/mIoU | 43.6| Agent benchmark | Benchmarks | Qwen2.5-VL-7B | |-------------------------|---------------| | ScreenSpot | 84.7 | | ScreenSpot Pro | 29.0 | | AITZEM | 81.9 | | Android Control HighEM | 60.1 | | Android Control LowEM | 93.7 | | AndroidWorldSR | 25.5 | | MobileMiniWob++SR | 91.4 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: { ..., "type": "yarn", "mropesection": [ 16, 24, 24 ], "factor": 4, "originalmaxpositionembeddings": 32768 } However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.

—

4,998,699

1,338

Qwen3-Embedding-0.6B

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. With Transformers versions earlier than 4.51.0, you may encounter the following error: 📌 Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. And then, generate the embeddings sending a HTTP POST request as: | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:| | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10| | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33| | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12| | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81| | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61| | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98| | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68| | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80| | Gemini Embedding | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40| | Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17| | Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86| | Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | > Note: For compared models, the scores are retrieved from MTEB online leaderboard on May 24th, 2025. | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. | |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:| | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 | | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 | | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 | | stellaen1.5Bv5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 | | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 | | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 | | Qwen3-Embedding-0.6B | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 | | Qwen3-Embedding-4B | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | 88.72 | 34.39 | | Qwen3-Embedding-8B | 8B | 75.22 | 68.71 | 90.43 | 58.57 | 87.52 | 51.56 | 69.44 | 88.58 | 34.83 | | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------| | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 | | bge-multilingual-gemma2 | 9B | 67.64 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | - | | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 | | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 | | ritrievezhv1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-0.6B | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | If you find our work helpful, feel free to give us a cite.

—

4,895,329

727

Qwen3-8B

--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-8B-Base ---

—

3,648,029

732

Qwen2.5-3B-Instruct

—

3,473,371

326

Qwen2.5-1.5B-Instruct

—

3,267,796

538

Qwen2-VL-2B-Instruct

We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 2B Qwen2-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2-2B | MiniCPM-V 2.0 | Qwen2-VL-2B | | :--- | :---: | :---: | :---: | | MMMU val | 36.3 | 38.2 | 41.1 | | DocVQA test | 86.9 | - | 90.1 | | InfoVQA test | 58.9 | - | 65.5 | | ChartQA test | 76.2 | - | 73.5 | | TextVQA val | 73.4 | - | 79.7 | | OCRBench | 781 | 605 | 794 | | MTVQA | - | - | 20.0 | | VCR en easy | - | - | 81.45 | VCR zh easy | - | - | 46.16 | RealWorldQA | 57.3 | 55.8 | 62.9 | | MME sum | 1876.8 | 1808.6 | 1872.0 | | MMBench-EN test | 73.2 | 69.1 | 74.9 | | MMBench-CN test | 70.9 | 66.5 | 73.5 | | MMBench-V1.1 test | 69.6 | 65.8 | 72.2 | | MMT-Bench test | - | - | 54.5 | | MMStar | 49.8 | 39.1 | 48.0 | | MMVet GPT-4-Turbo | 39.7 | 41.0 | 49.5 | | HallBench avg | 38.0 | 36.1 | 41.7 | | MathVista testmini | 46.0 | 39.8 | 43.0 | | MathVision | - | - | 12.4 | | Benchmark | Qwen2-VL-2B | | :--- | :---: | | MVBench | 63.2 | | PerceptionTest test | 53.9 | | EgoSchema test | 54.9 | | Video-MME wo/w subs | 55.6/60.4 | Requirements The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Quickstart We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions: 1. Lack of Audio Support: The current model does not comprehend audio information within videos. 2. Data timeliness: Our image dataset is updated until June 2023, and information subsequent to this date may not be covered. 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands. 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement. 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements. 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects. These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application. If you find our work helpful, feel free to give us a cite.

Qwen

Qwen2.5-7B-Instruct

Qwen2.5-VL-3B-Instruct

Qwen3-0.6B

Qwen3-4B-Instruct-2507

Qwen2.5-VL-7B-Instruct

Qwen3-Embedding-0.6B

Qwen3-8B

Qwen2.5-3B-Instruct

Qwen2.5-1.5B-Instruct

Qwen2-VL-2B-Instruct

Qwen2.5-VL-32B-Instruct

Qwen3-4B

Qwen3-VL-8B-Instruct

Qwen2.5-0.5B-Instruct

Qwen2-VL-7B-Instruct

Qwen3-VL-30B-A3B-Instruct

Qwen2.5-1.5B

Qwen3-Reranker-0.6B

Qwen3-32B

Qwen3-Next-80B-A3B-Instruct

Qwen3-VL-32B-Instruct

Qwen2.5-7B

Qwen2.5-32B-Instruct-AWQ

Qwen3-1.7B

Qwen3-30B-A3B-Instruct-2507

Qwen3-Embedding-8B

Qwen2.5-7B-Instruct-AWQ

Qwen2.5-0.5B

Qwen3-14B

Qwen2-0.5B

Qwen2.5-14B-Instruct

Qwen2.5-VL-72B-Instruct

Qwen3-0.6B-Base

Qwen3-VL-4B-Instruct

Qwen3-Coder-30B-A3B-Instruct

Qwen2.5-Math-1.5B

Qwen3-30B-A3B-Instruct-2507-FP8

Qwen2.5-32B-Instruct-GPTQ-Int8

Qwen2.5-32B-Instruct

Qwen3-8B-FP8

Qwen3-4B-Base

Qwen2.5-Coder-7B-Instruct-AWQ

Qwen2.5-Coder-1.5B

Qwen2-1.5B-Instruct

Qwen3-30B-A3B

Qwen3-Embedding-4B

Qwen2.5-14B

Qwen2.5-72B-Instruct

Qwen3-Omni-30B-A3B-Instruct

Qwen2.5-Coder-7B-Instruct

Qwen2.5-VL-7B-Instruct-AWQ

Qwen-Image-Edit-2509

Qwen3-30B-A3B-GPTQ-Int4

Qwen3-4B-Thinking-2507

Qwen2-7B-Instruct

Qwen3-VL-2B-Instruct

Qwen3-235B-A22B

Qwen3-14B-Base

Qwen2.5-Omni-3B

Qwen3-30B-A3B-Thinking-2507

Qwen3-32B-AWQ

Qwen3-Next-80B-A3B-Instruct-FP8

Qwen3-8B-Base

Qwen3-VL-8B-Thinking

Qwen2.5-3B

Qwen3-14B-AWQ

Qwen3-4B-Thinking-2507-FP8

Qwen3-Next-80B-A3B-Thinking-FP8

Qwen-Image

Qwen2.5-32B

Qwen2-0.5B-Instruct

Qwen3-VL-235B-A22B-Instruct-FP8

Qwen3-VL-30B-A3B-Instruct-FP8

Qwen-Image-Edit

Qwen3-8B-AWQ

Qwen2.5-72B-Instruct-AWQ

Qwen3-Coder-30B-A3B-Instruct-FP8

Qwen-7B

Qwen2.5-Omni-7B