Qwen
✓ VerifiedAI StartupAlibaba Cloud's Qwen (Tongyi Qianwen) model family
Qwen2.5-7B-Instruct
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 7B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-VL-3B-Instruct
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MMMU val | 52.3 | 54.1 | 53.1| | MMMU-Pro val | 32.7 | 30.5 | 31.6| | AI2D test | 81.4 | 83.0 | 81.5 | | DocVQA test | 91.6 | 94.5 | 93.9 | | InfoVQA test | 72.1 | 76.5 | 77.1 | | TextVQA val | 76.8 | 84.3 | 79.3| | MMBench-V1.1 test | 79.3 | 80.7 | 77.6 | | MMStar | 58.3 | 60.7 | 55.9 | | MathVista testmini | 60.5 | 58.2 | 62.3 | | MathVision full | 20.9 | 16.3 | 21.2 | Video benchmark | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MVBench | 71.6 | 67.0 | 67.0 | | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 | | MLVU | 48.3 | - | 68.2 | | LVBench | - | - | 43.3 | | MMBench-Video | 1.73 | 1.44 | 1.63 | | EgoSchema | - | - | 64.8 | | PerceptionTest | - | - | 66.9 | | TempCompass | - | - | 64.4 | | LongVideoBench | 55.2 | 55.6 | 54.2 | | CharadesSTA/mIoU | - | - | 38.8 | Agent benchmark | Benchmarks | Qwen2.5-VL-3B | |-------------------------|---------------| | ScreenSpot | 55.5 | | ScreenSpot Pro | 23.9 | | AITZEM | 76.9 | | Android Control HighEM | 63.7 | | Android Control LowEM | 22.2 | | AndroidWorldSR | 90.8 | | MobileMiniWob++SR | 67.9 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.
Qwen3-0.6B
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. > [!TIP] > If you encounter significant endless repetitions, please refer to the Best Practices section for optimal sampling parameters, and set the ``presencepenalty`` to 1.5. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-0.6B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-4B-Instruct-2507
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen2.5-VL-7B-Instruct
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-8B | MiniCPM-o 2.6 | GPT-4o-mini | Qwen2-VL-7B |Qwen2.5-VL-7B | | :--- | :---: | :---: | :---: | :---: | :---: | | MMMU val | 56 | 50.4 | 60| 54.1 | 58.6| | MMMU-Pro val | 34.3 | - | 37.6| 30.5 | 41.0| | DocVQA test | 93 | 93 | - | 94.5 | 95.7 | | InfoVQA test | 77.6 | - | - |76.5 | 82.6 | | ChartQA test | 84.8 | - |- | 83.0 |87.3 | | TextVQA val | 79.1 | 80.1 | -| 84.3 | 84.9| | OCRBench | 822 | 852 | 785 | 845 | 864 | | CCOCR | 57.7 | | | 61.6 | 77.8| | MMStar | 62.8| | |60.7| 63.9| | MMBench-V1.1-En test | 79.4 | 78.0 | 76.0| 80.7 | 82.6 | | MMT-Bench test | - | - | - |63.7 |63.6 | | MMStar | 61.5 | 57.5 | 54.8 | 60.7 |63.9 | | MMVet GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | 67.1| | HallBench avg | 45.2 | 48.1 | 46.1| 50.6 | 52.9| | MathVista testmini | 58.3 | 60.6 | 52.4 | 58.2 | 68.2| | MathVision | - | - | - | 16.3 | 25.07 | | Benchmark | Qwen2-VL-7B | Qwen2.5-VL-7B | | :--- | :---: | :---: | | MVBench | 67.0 | 69.6 | | PerceptionTest test | 66.9 | 70.5 | | Video-MME wo/w subs | 63.3/69.0 | 65.1/71.6 | | LVBench | | 45.3 | | LongVideoBench | | 54.7 | | MMBench-Video | 1.44 | 1.79 | | TempCompass | | 71.7 | | MLVU | | 70.2 | | CharadesSTA/mIoU | 43.6| Agent benchmark | Benchmarks | Qwen2.5-VL-7B | |-------------------------|---------------| | ScreenSpot | 84.7 | | ScreenSpot Pro | 29.0 | | AITZEM | 81.9 | | Android Control HighEM | 60.1 | | Android Control LowEM | 93.7 | | AndroidWorldSR | 25.5 | | MobileMiniWob++SR | 91.4 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: { ..., "type": "yarn", "mropesection": [ 16, 24, 24 ], "factor": 4, "originalmaxpositionembeddings": 32768 } However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.
Qwen3-Embedding-0.6B
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. With Transformers versions earlier than 4.51.0, you may encounter the following error: 📌 Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. And then, generate the embeddings sending a HTTP POST request as: | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:| | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10| | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33| | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12| | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81| | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61| | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98| | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68| | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80| | Gemini Embedding | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40| | Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17| | Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86| | Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | > Note: For compared models, the scores are retrieved from MTEB online leaderboard on May 24th, 2025. | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. | |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:| | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 | | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 | | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 | | stellaen1.5Bv5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 | | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 | | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 | | Qwen3-Embedding-0.6B | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 | | Qwen3-Embedding-4B | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | 88.72 | 34.39 | | Qwen3-Embedding-8B | 8B | 75.22 | 68.71 | 90.43 | 58.57 | 87.52 | 51.56 | 69.44 | 88.58 | 34.83 | | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------| | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 | | bge-multilingual-gemma2 | 9B | 67.64 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | - | | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 | | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 | | ritrievezhv1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-0.6B | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | If you find our work helpful, feel free to give us a cite.
Qwen3-8B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-8B-Base ---
Qwen2.5-3B-Instruct
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 3B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-1.5B-Instruct
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 1.5B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 1.54B - Number of Paramaters (Non-Embedding): 1.31B - Number of Layers: 28 - Number of Attention Heads (GQA): 12 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2-VL-2B-Instruct
We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 2B Qwen2-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2-2B | MiniCPM-V 2.0 | Qwen2-VL-2B | | :--- | :---: | :---: | :---: | | MMMU val | 36.3 | 38.2 | 41.1 | | DocVQA test | 86.9 | - | 90.1 | | InfoVQA test | 58.9 | - | 65.5 | | ChartQA test | 76.2 | - | 73.5 | | TextVQA val | 73.4 | - | 79.7 | | OCRBench | 781 | 605 | 794 | | MTVQA | - | - | 20.0 | | VCR en easy | - | - | 81.45 | VCR zh easy | - | - | 46.16 | RealWorldQA | 57.3 | 55.8 | 62.9 | | MME sum | 1876.8 | 1808.6 | 1872.0 | | MMBench-EN test | 73.2 | 69.1 | 74.9 | | MMBench-CN test | 70.9 | 66.5 | 73.5 | | MMBench-V1.1 test | 69.6 | 65.8 | 72.2 | | MMT-Bench test | - | - | 54.5 | | MMStar | 49.8 | 39.1 | 48.0 | | MMVet GPT-4-Turbo | 39.7 | 41.0 | 49.5 | | HallBench avg | 38.0 | 36.1 | 41.7 | | MathVista testmini | 46.0 | 39.8 | 43.0 | | MathVision | - | - | 12.4 | | Benchmark | Qwen2-VL-2B | | :--- | :---: | | MVBench | 63.2 | | PerceptionTest test | 53.9 | | EgoSchema test | 54.9 | | Video-MME wo/w subs | 55.6/60.4 | Requirements The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Quickstart We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions: 1. Lack of Audio Support: The current model does not comprehend audio information within videos. 2. Data timeliness: Our image dataset is updated until June 2023, and information subsequent to this date may not be covered. 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands. 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement. 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements. 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects. These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application. If you find our work helpful, feel free to give us a cite.
Qwen2.5-VL-32B-Instruct
--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers ---
Qwen3-4B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-4B-Base ---
Qwen3-VL-8B-Instruct
--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers ---
Qwen2.5-0.5B-Instruct
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE language: - en pipeline_tag: text-generation base_model: Qwen/Qwen2.5-0.5B tags: - chat library_name: transformers ---
Qwen2-VL-7B-Instruct
--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers base_model: - Qwen/Qwen2-VL-7B new_version: Qwen/Qwen2.5-VL-7B-Instruct ---
Qwen3-VL-30B-A3B-Instruct
This model is designed for image-text-to-text tasks and is licensed under the Apache 2.0 license.
Qwen2.5-1.5B
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-1.5B/blob/main/LICENSE language: - en pipeline_tag: text-generation library_name: transformers ---
Qwen3-Reranker-0.6B
--- license: apache-2.0 base_model: - Qwen/Qwen3-0.6B-Base library_name: transformers pipeline_tag: text-ranking ---
Qwen3-32B
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-32B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 32.8B - Number of Paramaters (Non-Embedding): 31.2B - Number of Layers: 64 - Number of Attention Heads (GQA): 64 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-32B --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-32B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "ropetype": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-Next-80B-A3B-Instruct
Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next. Qwen3-Next-80B-A3B is the first installment in the Qwen3-Next series and features the following key enchancements: - Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling for ultra-long context length. - High-Sparsity Mixture-of-Experts (MoE): Achieves an extreme low activation ratio in MoE layers, drastically reducing FLOPs per token while preserving model capacity. - Stability Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, and other stabilizing enhancements for robust pre-training and post-training. - Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference. We are seeing strong performance in terms of both parameter efficiency and inference speed for Qwen3-Next-80B-A3B: - Qwen3-Next-80B-A3B-Base outperforms Qwen3-32B-Base on downstream tasks with 10% of the total training cost and with 10 times inference throughput for context over 32K tokens. - Qwen3-Next-80B-A3B-Instruct performs on par with Qwen3-235B-A22B-Instruct-2507 on certain benchmarks, while demonstrating significant advantages in handling ultra-long-context tasks up to 256K tokens. For more details, please refer to our blog post Qwen3-Next. > [!Note] > Qwen3-Next-80B-A3B-Instruct supports only instruct (non-thinking) mode and does not generate `` `` blocks in its output. Qwen3-Next-80B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining (15T tokens) & Post-training - Number of Parameters: 80B in total and 3B activated - Number of Paramaters (Non-Embedding): 79B - Hidden Dimension: 2048 - Number of Layers: 48 - Hybrid Layout: 12 \ (3 \ (Gated DeltaNet -> MoE) -> 1 \ (Gated Attention -> MoE)) - Gated Attention: - Number of Attention Heads: 16 for Q and 2 for KV - Head Dimension: 256 - Rotary Position Embedding Dimension: 64 - Gated DeltaNet: - Number of Linear Attention Heads: 32 for V and 16 for QK - Head Dimension: 128 - Mixture of Experts: - Number of Experts: 512 - Number of Activated Experts: 10 - Number of Shared Experts: 1 - Expert Intermediate Dimension: 512 - Context Length: 262,144 natively and extensible up to 1,010,000 tokens | | Qwen3-30B-A3B-Instruct-2507 | Qwen3-32B Non-Thinking | Qwen3-235B-A22B-Instruct-2507 | Qwen3-Next-80B-A3B-Instruct | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 78.4 | 71.9 | 83.0 | 80.6 | | MMLU-Redux | 89.3 | 85.7 | 93.1 | 90.9 | | GPQA | 70.4 | 54.6 | 77.5 | 72.9 | | SuperGPQA | 53.4 | 43.2 | 62.6 | 58.8 | | Reasoning | | | | | | AIME25 | 61.3 | 20.2 | 70.3 | 69.5 | | HMMT25 | 43.0 | 9.8 | 55.4 | 54.1 | | LiveBench 20241125 | 69.0 | 59.8 | 75.4 | 75.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 43.2 | 29.1 | 51.8 | 56.6 | | MultiPL-E | 83.8 | 76.9 | 87.9 | 87.8 | | Aider-Polyglot | 35.6 | 40.0 | 57.3 | 49.8 | | Alignment | | | | | | IFEval | 84.7 | 83.2 | 88.7 | 87.6 | | Arena-Hard v2 | 69.0 | 34.1 | 79.2 | 82.7 | | Creative Writing v3 | 86.0 | 78.3 | 87.5 | 85.3 | | WritingBench | 85.5 | 75.4 | 85.2 | 87.3 | | Agent | | | | | | BFCL-v3 | 65.1 | 63.0 | 70.9 | 70.3 | | TAU1-Retail | 59.1 | 40.1 | 71.3 | 60.9 | | TAU1-Airline | 40.0 | 17.0 | 44.0 | 44.0 | | TAU2-Retail | 57.0 | 48.8 | 74.6 | 57.3 | | TAU2-Airline | 38.0 | 24.0 | 50.0 | 45.5 | | TAU2-Telecom | 12.3 | 24.6 | 32.5 | 13.2 | | Multilingualism | | | | | | MultiIF | 67.9 | 70.7 | 77.5 | 75.8 | | MMLU-ProX | 72.0 | 69.3 | 79.4 | 76.7 | | INCLUDE | 71.9 | 70.9 | 79.5 | 78.9 | | PolyMATH | 43.1 | 22.5 | 50.2 | 45.9 | : For reproducibility, we report the win rates evaluated by GPT-4.1. The code for Qwen3-Next has been merged into the main branch of Hugging Face `transformers`. With earlier versions, you will encounter the following error: The following contains a code snippet illustrating how to use the model generate content based on given inputs. > [!Note] > Multi-Token Prediction (MTP) is not generally available in Hugging Face Transformers. > [!Note] > The efficiency or throughput improvement depends highly on the implementation. > It is recommended to adopt a dedicated inference framework, e.g., SGLang and vLLM, for inference tasks. > [!Tip] > Depending on the inference settings, you may observe better efficiency with `flash-linear-attention` and `causal-conv1d`. > See the links for detailed instructions and requirements. For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint. SGLang is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service. `sglang>=0.5.2` is required for Qwen3-Next, which can be installed using: The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs. The following command is recommended for MTP with the rest settings the same as above: > [!Note] > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start. Please also refer to SGLang's usage guide on Qwen3-Next. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM could be used to launch a server with OpenAI-compatible API service. `vllm>=0.10.2` is required for Qwen3-Next, which can be installed using: The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs. The following command is recommended for MTP with the rest settings the same as above: > [!Note] > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start. Please also refer to vLLM's usage guide on Qwen3-Next. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method. YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm` and `sglang`. In general, there are two approaches to enabling YaRN for supported frameworks: - Modifying the model files: In the `config.json` file, add the `ropescaling` fields: > [!NOTE] > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. > We advise adding the `ropescaling` configuration only when processing long contexts is required. > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set `factor` as 2.0. We test the model on an 1M version of the RULER benchmark. | Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k | |---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------| | Qwen3-30B-A3B-Instruct-2507 | 86.8 | 98.0 | 96.7 | 96.9 | 97.2 | 93.4 | 91.0 | 89.1 | 89.8 | 82.5 | 83.6 | 78.4 | 79.7 | 77.6 | 75.7 | 72.8 | | Qwen3-235B-A22B-Instruct-2507 | 92.5 | 98.5 | 97.6 | 96.9 | 97.3 | 95.8 | 94.9 | 93.9 | 94.5 | 91.0 | 92.2 | 90.9 | 87.8 | 84.8 | 86.5 | 84.5 | | Qwen3-Next-80B-A3B-Instruct | 91.8 | 98.5 | 99.0 | 98.0 | 98.7 | 97.6 | 95.0 | 96.0 | 94.0 | 93.5 | 91.7 | 86.9 | 85.5 | 81.7 | 80.3 | 80.3 | Qwen3-Next are evaluated with YaRN enabled. Qwen3-2507 models are evaluated with Dual Chunk Attention enabled. Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each). To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.
Qwen3-VL-32B-Instruct
--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers ---
Qwen2.5-7B
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE language: - en pipeline_tag: text-generation library_name: transformers ---
Qwen2.5-32B-Instruct-AWQ
--- base_model: Qwen/Qwen2.5-32B-Instruct language: - en library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-AWQ/blob/main/LICENSE pipeline_tag: text-generation tags: - chat ---
Qwen3-1.7B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-1.7B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-1.7B-Base ---
Qwen3-30B-A3B-Instruct-2507
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen3-Embedding-8B
--- license: apache-2.0 base_model: - Qwen/Qwen3-8B-Base tags: - transformers - sentence-transformers - sentence-similarity - feature-extraction - text-embeddings-inference ---
Qwen2.5-7B-Instruct-AWQ
--- base_model: Qwen/Qwen2.5-7B-Instruct language: - en library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-AWQ/blob/main/LICENSE pipeline_tag: text-generation tags: - chat ---
Qwen2.5-0.5B
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-0.5B/blob/main/LICENSE language: - en pipeline_tag: text-generation library_name: transformers ---
Qwen3-14B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-14B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-14B-Base ---
Qwen2-0.5B
--- language: - en pipeline_tag: text-generation tags: - pretrained license: apache-2.0 new_version: Qwen/Qwen2.5-0.5B ---
Qwen2.5-14B-Instruct
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct/blob/main/LICENSE language: - en pipeline_tag: text-generation base_model: Qwen/Qwen2.5-14B tags: - chat library_name: transformers ---
Qwen2.5-VL-72B-Instruct
--- license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct/blob/main/LICENSE language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers ---
Qwen3-0.6B-Base
--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation ---
Qwen3-VL-4B-Instruct
--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers ---
Qwen3-Coder-30B-A3B-Instruct
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen2.5-Math-1.5B
--- base_model: Qwen/Qwen2.5-1.5B language: - en pipeline_tag: text-generation library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/blob/main/LICENSE ---
Qwen3-30B-A3B-Instruct-2507-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-30B-A3B-Instruct-2507 ---
Qwen2.5-32B-Instruct-GPTQ-Int8
--- base_model: Qwen/Qwen2.5-32B-Instruct language: - en library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8/blob/main/LICENSE pipeline_tag: text-generation tags: - chat ---
Qwen2.5-32B-Instruct
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct/blob/main/LICENSE language: - en pipeline_tag: text-generation base_model: Qwen/Qwen2.5-32B tags: - chat library_name: transformers ---
Qwen3-8B-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-8B-FP8/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-8B ---
Qwen3-4B-Base
--- license: apache-2.0 library_name: transformers ---
Qwen2.5-Coder-7B-Instruct-AWQ
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/LICENSE language: - en base_model: - Qwen/Qwen2.5-Coder-7B-Instruct pipeline_tag: text-generation library_name: transformers tags: - code - codeqwen - chat - qwen - qwen-coder ---
Qwen2.5-Coder-1.5B
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B/blob/main/LICENSE language: - en base_model: - Qwen/Qwen2.5-1.5B pipeline_tag: text-generation library_name: transformers tags: - code - qwen - qwen-coder - codeqwen ---
Qwen2-1.5B-Instruct
--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - chat ---
Qwen3-30B-A3B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-30B-A3B-Base ---
Qwen3-Embedding-4B
--- license: apache-2.0 base_model: - Qwen/Qwen3-4B-Base tags: - transformers - sentence-transformers - sentence-similarity - feature-extraction - text-embeddings-inference ---
Qwen2.5-14B
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-14B/blob/main/LICENSE language: - en pipeline_tag: text-generation ---
Qwen2.5-72B-Instruct
--- license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE language: - en pipeline_tag: text-generation base_model: Qwen/Qwen2.5-72B tags: - chat library_name: transformers ---
Qwen3-Omni-30B-A3B-Instruct
--- license: other license_name: apache-2.0 language: - en tags: - multimodal library_name: transformers pipeline_tag: any-to-any ---
Qwen2.5-Coder-7B-Instruct
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct/blob/main/LICENSE language: - en base_model: - Qwen/Qwen2.5-Coder-7B pipeline_tag: text-generation library_name: transformers tags: - code - codeqwen - chat - qwen - qwen-coder ---
Qwen2.5-VL-7B-Instruct-AWQ
--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal library_name: transformers base_model: - Qwen/Qwen2.5-VL-7B-Instruct ---
Qwen-Image-Edit-2509
--- license: apache-2.0 language: - en - zh library_name: diffusers pipeline_tag: image-to-image --- 💜 Qwen Chat   |   🤗 Hugging Face   |   🤖 M
Qwen3-30B-A3B-GPTQ-Int4
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE pipeline_tag: text-generation base_model: Qwen/Qwen3-30B-A3B ---
Qwen3-4B-Thinking-2507
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen2-7B-Instruct
--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - chat base_model: Qwen/Qwen2-7B ---
Qwen3-VL-2B-Instruct
--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers ---
Qwen3-235B-A22B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen3-14B-Base
--- license: apache-2.0 library_name: transformers ---
Qwen2.5-Omni-3B
--- license: other license_name: qwen-research license_link: LICENSE language: - en tags: - multimodal library_name: transformers pipeline_tag: any-to-any ---
Qwen3-30B-A3B-Thinking-2507
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen3-32B-AWQ
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-32B/blob/main/LICENSE pipeline_tag: text-generation base_model: Qwen/Qwen3-32B ---
Qwen3-Next-80B-A3B-Instruct-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-Next-80B-A3B-Instruct ---
Qwen3-8B-Base
--- license: apache-2.0 library_name: transformers ---
Qwen3-VL-8B-Thinking
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deepe...
Qwen2.5-3B
--- license: other license_name: qwen-research license_link: https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE language: - en pipeline_tag: text-generation ---
Qwen3-14B-AWQ
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-14B/blob/main/LICENSE pipeline_tag: text-generation base_model: Qwen/Qwen3-14B ---
Qwen3-4B-Thinking-2507-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507-FP8/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-4B-Thinking-2507 ---
Qwen3-Next-80B-A3B-Thinking-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking-FP8/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-Next-80B-A3B-Thinking ---
Qwen-Image
--- license: apache-2.0 language: - en - zh library_name: diffusers pipeline_tag: text-to-image --- 💜 Qwen Chat   |   🤗 Hugging Face   |   🤖 ModelScope   |
Qwen2.5-32B
--- license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE language: - en pipeline_tag: text-generation ---
Qwen2-0.5B-Instruct
--- license: apache-2.0 language: - en pipeline_tag: text-generation tags: - chat base_model: Qwen/Qwen2-0.5B ---
Qwen3-VL-235B-A22B-Instruct-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-235B-A22B model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original BF16 model. Enjoy! Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-235B-A22B-Instruct-FP8. Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned! We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this link. Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the community deployment guide. Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-30B-A3B-Instruct-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-30B-A3B-Instruct model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metr...
Qwen-Image-Edit
--- license: apache-2.0 language: - en - zh library_name: diffusers pipeline_tag: image-to-image --- 💜 Qwen Chat   |   🤗 Hugging Face   |   🤖 ModelScope<
Qwen3-8B-AWQ
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE pipeline_tag: text-generation base_model: Qwen/Qwen3-8B ---
Qwen2.5-72B-Instruct-AWQ
--- base_model: Qwen/Qwen2.5-72B-Instruct language: - en library_name: transformers license: other license_name: qwen license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-AWQ/blob/main/LICENSE pipeline_tag: text-generation tags: - chat ---
Qwen3-Coder-30B-A3B-Instruct-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen-7B
🤗 Hugging Face    |   🤖 ModelScope    |    📑 Paper    |   🖥️ Demo WeChat (微信)    |    Discord    |    API 通义千问-7B(Qwen-7B)是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。相较于最初开源的Qwen-7B模型,我们现已将预训练模型和Chat模型更新到效果更优的版本。本仓库为Qwen-7B预训练模型的仓库。 1. 大规模高质量训练语料:使用超过2.4万亿tokens的数据进行预训练,包含高质量中、英、多语言、代码、数学等数据,涵盖通用及专业领域的训练语料。通过大量对比实验对预训练语料分布进行了优化。 2. 强大的性能:Qwen-7B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的相近规模开源模型,甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。 3. 覆盖更全面的词表:相比目前以中英词表为主的开源模型,Qwen-7B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。 Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. Now we have updated both our pretrained and chat models for better performances. This repository is the one for the Qwen-7B base language model. 1. Large-scale high-quality training corpora: It is pretrained on over 2.4 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments. 2. Competitive performance: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results. 3. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary. For more details about Qwen, please refer to the GitHub code repository. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项) python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) To run Qwen-7B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. 另外,推荐安装`flash-attention`库(当前已支持flash attention 2),以实现更高的效率和更低的显存占用。 In addition, it is recommended to install the `flash-attention` library (we support flash attention 2 now.) for higher efficiency and lower memory usage. You can easily call the model with the following code: For more information, please refer to our GitHub repo for more information. > 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 基于tiktoken的分词器有别于其他分词器,比如sentencepiece分词器。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅文档。 Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation. The details of the model architecture of Qwen-7B are listed as follows. | Hyperparameter | Value | |:----------------|:-------| | nlayers | 32 | | nheads | 32 | | dmodel | 4096 | | vocab size | 151851 | | sequence length | 8192 | 在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法, 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。 在分词器方面,相比目前主流开源模型以中英词表为主,Qwen-7B使用了超过15万token大小的词表。 该词表在GPT-4使用的BPE词表`cl100kbase`基础上,对中文、多语言进行了优化,在对中、英、代码数据的高效编解码的基础上,对部分多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强。 词表对数字按单个数字位切分。调用较为高效的tiktoken分词库进行分词。 我们从部分语种各随机抽取100万个文档语料,以对比不同模型的编码压缩率(以支持100语种的XLM-R为基准值1,越低越好),具体性能见图。 可以看到Qwen-7B在保持中英代码高效解码的前提下,对部分使用人群较多的语种(泰语th、希伯来语he、阿拉伯语ar、韩语ko、越南语vi、日语ja、土耳其语tr、印尼语id、波兰语pl、俄语ru、荷兰语nl、葡萄牙语pt、意大利语it、德语de、西班牙语es、法语fr等)上也实现了较高的压缩率,使得模型在这些语种上也具备较强的可扩展性和较高的训练和推理效率。 在预训练数据方面,去重及过滤后的语料超过2.4T tokens,囊括全网文本、百科、书籍、代码、数学及各个领域垂类。 For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration). For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization. We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above. As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-7B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages. The scale of pretraining corpus reaches over 2.4T tokens after deduplication and filtration, encompassing web text, encyclopedia, books, code, mathematics, and various domains. 评测效果(Evaluation) 我们选取了MMLU,C-Eval,GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU等目前较流行的benchmark,对模型的中英知识能力、翻译、数学推理、代码等能力进行综合评测。从下列结果可以看到Qwen模型在所有benchmark上均取得了同级别开源模型中的最优表现。 We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the model’s Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks. | Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | |:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| | | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | | LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | | LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | | LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | | ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | | InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | | InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | | Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | | Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | | Qwen-7B (original) | 56.7 | 59.6 | 51.6 | - | 24.4 | 31.2 | 40.6 | 58.8 | | Qwen-7B | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | | Qwen-14B | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | 我们引入NTK插值,LogN注意力缩放,窗口注意力等技巧,将Qwen-7B (original)和14B模型的上下文长度从2K扩展到8K以上,将Qwen-7B从8K扩到32K。在arXiv数据上使用PPL指标测试Qwen-7B和Qwen-14B在不同长度下的表现,结果如下: (若要启用NTK和LogN注意力缩放,请将config.json里的`usedynamicntk`和`uselognattn`设置为true) We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below: (To use NTK interpolation and LogN scaling, please set `usedynamicntk` and `uselongattn` to true in config.json.) Qwen-7B (original) 4.23 3.78 39.35 469.81 2645.09 - + dynamicntk + logn + windowattn 4.23 3.78 3.58 3.49 4.32 - + dynamicntk + logn + windowattn 4.23 3.81 3.52 3.33 3.22 3.17 + dynamicntk + logn + windowattn - 3.46 3.29 3.18 3.42 - 我们提供了评测脚本,方便大家复现模型效果,详见链接。提示:由于硬件和框架造成的舍入误差,复现结果如有小幅波动属于正常现象。 We have provided evaluation scripts to reproduce the performance of our model, details as link. If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. If you find our work helpful, feel free to give us a cite. 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看LICENSE了解具体的开源协议细节。如需商用,请填写问卷申请。 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply. 如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群、钉钉群以及Discord!同时,也欢迎通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [email protected].
Qwen2.5-Omni-7B
Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness). OmniBench Speech | Sound Event | Music | Avg Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91% Librispeech dev-clean | dev other | test-clean | test-other SALMONN -|-|2.1|4.9 Common Voice 15 en | zh | yue | fr Whisper-large-v3 9.3|12.8|10.9|10.8 Wenetspeech test-net | test-meeting Seed-ASR-Chinese 4.7|5.7 CoVoST2 en-de | de-en | en-zh | zh-en SALMONN 18.6|-|33.1|- MusicCaps LP-MusicCaps 0.291|0.149|0.089| 0.061 |0.129|0.130 Qwen2.5-Omni-3B 0.325| 0.163 | 0.093 |0.057| 0.132 | 0.229 Qwen2.5-Omni-7B 0.328 |0.162|0.090|0.055|0.127|0.225 MMAU Sound | Music | Speech | Avg Gemini-Pro-V1.5 56.75|49.40|58.55|54.90 VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU Ultravox-v0.4.1-LLaMA-3.1-8B 4.55 |3.90|53.35|47.17 VoiceBench OpenBookQA | IFEval | AdvBench | Avg Ultravox-v0.4.1-LLaMA-3.1-8B 65.27| 66.88 |98.46|71.45 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMU val | 59.2 | 53.1 | 53.9 | 58.6 | 60.0 | | MMMU-Pro overall | 36.6 | 29.7 | - | 38.3 | 37.6 | | MathVista testmini | 67.9 | 59.4 | 71.9 | 68.2 | 52.5 | | MathVision full | 25.0 | 20.8 | 23.1 | 25.1 | - | | MMBench-V1.1-EN test | 81.8 | 77.8 | 80.5 | 82.6 | 76.0 | | MMVet turbo | 66.8 | 62.1 | 67.5 | 67.1 | 66.9 | | MMStar | 64.0 | 55.7 | 64.0 | 63.9 | 54.8 | | MME sum | 2340 | 2117 | 2372 | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | 59.2 | - | | CRPE relation | 76.5 | 73.7 | - | 76.4 | - | | RealWorldQA avg | 70.3 | 62.6 | 71.9 | 68.5 | - | | MME-RealWorld en | 61.6 | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | 6.3 | - | | AI2D | 83.2 | 79.5 | 85.8 | 83.9 | - | | TextVQA val | 84.4 | 79.8 | 83.2 | 84.9 | - | | DocVQA test | 95.2 | 93.3 | 93.5 | 95.7 | - | | ChartQA test Avg | 85.3 | 82.8 | 84.9 | 87.3 | - | | OCRBenchV2 en | 57.8 | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcoco val | 90.5 | 88.7 | 90.0 | 90.6 | 73.2 | | Refcoco textA | 93.5 | 91.8 | 92.5 | 93.2 | 72.9 | | Refcoco textB | 86.6 | 84.0 | 85.4 | 88.2 | 74.6 | | Refcoco+ val | 85.4 | 81.1 | 84.2 | 88.2 | 62.5 | | Refcoco+ textA | 91.0 | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+ textB | 79.3 | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+ val | 87.4 | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+ test | 87.9 | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | 55.0 | 36.7 | | PointGrounding | 66.5 | 46.2 | 67.3 | - | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MME w/o sub | 64.3 | 62.0 | 63.9 | 65.1 | 64.8 | | Video-MME w sub | 72.4 | 68.6 | 67.9 | 71.6 | - | | MVBench | 70.3 | 68.7 | 67.2 | 69.6 | - | | EgoSchema test | 68.6 | 61.4 | 63.2 | 65.0 | - | SEED test-zh | test-en | test-hard Seed-TTSICL 1.11 | 2.24 | 7.58 SEED test-zh | test-en | test-hard Seed-TTSICL 0.796 | 0.762 | 0.776 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | 56.3 | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | 75.4 | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench 0831 | 29.6 | 22.3 | 35.9 | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | 36.4 | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | 75.5 | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | 91.6 | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | 84.8 | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | 79.2 | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | 70.4 | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench 2305-2409 | 24.6 | 16.5 | 28.7 | 19.9 | 23.9 | 8.3 | 18.9 | Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenomniutils`: |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend | | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attnimplementation="flashattention2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here. Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `returnaudio=False` is set. Here is an example. Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. Use audio in video In the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter in these places must be set to the same, otherwise unexpected results will occur. The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disabletalker()` after init the model. This option will save about `~2GB` of GPU memory but the `returnaudio` option for `generate` function will only allow to be set at `False`. In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs to get text responses faster. Change voice type of output audio Qwen2.5-Omni supports the ability to change the voice of the output audio. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types as follow: | Voice Type | Gender | Description | |------------|--------|-------------| | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.| Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`. First, make sure to install the latest version of Flash Attention 2: Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. To load and run a model using FlashAttention-2, add `attnimplementation="flashattention2"` when loading the model: If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
Qwen3-1.7B-Base
Qwen2.5-Coder-14B-Instruct
Qwen3-32B-FP8
Qwen2-Audio-7B-Instruct
Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes: voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input; audio analysis: users could provide audio and text instructions for analysis during the interaction; We release Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct, which are pretrained model and chat model respectively. For more details, please refer to our Blog, GitHub, and Report. Requirements The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: In the following, we demonstrate how to use `Qwen2-Audio-7B-Instruct` for the inference, supporting both voice chat and audio analysis modes. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `applychattemplate` for this purpose. Voice Chat Inference In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input: Audio Analysis Inference In the audio analysis, users could provide both audio and text instructions for analysis: If you find our work helpful, feel free to give us a cite.
Qwen3-4B-AWQ
Qwen2.5-14B-Instruct-AWQ
Qwen2.5-Coder-1.5B-Instruct
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. This repo contains the instruction-tuned 1.5B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 1.54B - Number of Paramaters (Non-Embedding): 1.31B - Number of Layers: 28 - Number of Attention Heads (GQA): 12 for Q and 2 for KV - Context Length: Full 32,768 tokens For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-3B-Instruct
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. This repo contains the instruction-tuned 3B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-Next-80B-A3B-Thinking
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen2.5-VL-3B-Instruct-AWQ
Qwen3-VL-32B-Thinking
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
Qwen3-235B-A22B-GPTQ-Int4
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE pipeline_tag: text-generation base_model: Qwen/Qwen3-235B-A22B ---
Qwen2.5-Coder-32B-Instruct
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the instruction-tuned 32B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2-1.5B
Qwen3-235B-A22B-Instruct-2507
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507/blob/main/LICENSE pipeline_tag: text-generation ---
Qwen3-VL-8B-Instruct-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-8B-Instruct model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics a...
Qwen3-VL-30B-A3B-Thinking-FP8
<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;"> <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block...
Qwen3-235B-A22B-Instruct-2507-FP8
Qwen3-4B-MLX-4bit
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-4B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest versions of both `transformers` (≥ 4.52.4) and `mlxlm` (≥ 0.25.2), and we advise you to use the latest version of `transformers` and `mlxlm`. Older versions (e.g., `transformers [!TIP] > The `enablethinking` switch is also available in APIs created by SGLang and vLLM. > Please refer to our documentation for SGLang and vLLM users. By default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. For example, when explicitly setting `enablethinking=True` or leaving it as the default value in `tokenizer.applychattemplate`, the model will engage its thinking mode. In this mode, the model will generate think content wrapped in a ` ... ` block, followed by the final response. > [!NOTE] > For thinking mode, use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0` (the default setting in `generationconfig.json`). DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. For more detailed guidance, please refer to the Best Practices section. We provide a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency. In this mode, the model will not generate any think content and will not include a ` ... ` block. > [!NOTE] > For non-thinking mode, we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. For more detailed guidance, please refer to the Best Practices section. Advanced Usage: Switching Between Thinking and Non-Thinking Modes via User Input We provide a soft switch mechanism that allows users to dynamically control the model's behavior when `enablethinking=True`. Specifically, you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. > [!NOTE] > For API compatibility, when `enablethinking=True`, regardless of whether the user uses `/think` or `/nothink`, the model will always output a block wrapped in ` ... `. However, the content inside this block may be empty if thinking is disabled. > When `enablethinking=False`, the soft switches are not valid. Regardless of any `/think` or `/nothink` tags input by the user, the model will not generate think content and will not include a ` ... ` block. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the YaRN method. YaRN is currently supported by several inference frameworks, e.g., `transformers` and `llama.cpp` for local use, `vllm` and `sglang` for deployment. In general, there are two approaches to enabling YaRN for supported frameworks: - Modifying the model files: In the `config.json` file, add the `ropescaling` fields: > [!IMPORTANT] > If you encounter the following warning > > please upgrade `transformers>=4.51.0`. > [!NOTE] > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. > We advise adding the `ropescaling` configuration only when processing long contexts is required. > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0. > [!NOTE] > The default `maxpositionembeddings` in `config.json` is set to 40,960. This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing. If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance. > [!TIP] > The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - For thinking mode (`enablethinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, and `MinP=0`. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. - For non-thinking mode (`enablethinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." 4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed. If you find our work helpful, feel free to give us a cite.
Qwen3Guard-Gen-0.6B
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-0.6B ---
Qwen3-VL-4B-Instruct-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-4B-Instruct model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics a...
Qwen3-VL-235B-A22B-Instruct
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deepe...
Qwen2.5-Math-7B
--- base_model: Qwen/Qwen2.5-7B language: - en pipeline_tag: text-generation library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen2.5-Math-7B/blob/main/LICENSE ---
Qwen2.5-Coder-32B-Instruct-GPTQ-Int4
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the GPTQ-quantized 4-bit instruction-tuned 32B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: GPTQ 4-bit For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our GPTQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-Coder-480B-A35B-Instruct
Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct. featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks, achieving results comparable to Claude Sonnet. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-480B-A35B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 480B in total and 35B activated - Number of Layers: 62 - Number of Attention Heads (GQA): 96 for Q and 8 for KV - Number of Experts: 160 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-480B-A35B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen2.5-1.5B-Instruct-AWQ
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the AWQ-quantized 4-bit instruction-tuned 72B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 1.54B - Number of Paramaters (Non-Embedding): 1.31B - Number of Layers: 28 - Number of Attention Heads (GQA): 12 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: AWQ 4-bit For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our AWQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-0.6B-FP8
Qwen1.5-7B
Qwen1.5-32B-Chat
Qwen3-Embedding-4B-GGUF
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. Qwen3-Embedding-4B-GGUF has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 4B - Context Length: 32k - Embedding Dimension: Up to 2560, supports user-defined output dimensions ranging from 32 to 2560 - Quantization: q4KM, q50, q5KM, q6K, q80, f16 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. 📌 Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. llama.cpp Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:| | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10| | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33| | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12| | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81| | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61| | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98| | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68| | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80| | gemini-embedding-exp-03-07 | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40| | Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17| | Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86| | Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | > Note: For compared models, the scores are retrieved from MTEB online leaderboard on May 24th, 2025. | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. | |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:| | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 | | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 | | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 | | stellaen1.5Bv5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 | | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 | | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 | | Qwen3-Embedding-0.6B | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 | | Qwen3-Embedding-4B | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | 88.72 | 34.39 | | Qwen3-Embedding-8B | 8B | 75.22 | 68.71 | 90.43 | 58.57 | 87.52 | 51.56 | 69.44 | 88.58 | 34.83 | | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------| | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 | | bge-multilingual-gemma2 | 9B | 67.64 |68.52 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 | | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 | | ritrievezhv1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-0.6B | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | If you find our work helpful, feel free to give us a cite.
Qwen2.5-32B-Instruct-GPTQ-Int4
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the GPTQ-quantized 4-bit instruction-tuned 32B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: GPTQ 4-bit For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our GPTQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen1.5-0.5B-Chat
Qwen2.5-VL-72B-Instruct-AWQ
Qwen3-30B-A3B-Base
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5: - Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data. - Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and overall performance. - Three-stage Pre-training: Stage 1 focuses on broad language modeling and general knowledge acquisition, Stage 2 improves reasoning skills like STEM, coding, and logical reasoning, and Stage 3 enhances long-context comprehension by extending training sequence lengths up to 32k tokens. - Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three-stage pre-training pipeline, Qwen3 systematically tunes critical hyperparameters — such as learning rate scheduler and batch size — separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales. Qwen3-30B-A3B-Base has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.51.0`, you will encounter the following error: Detailed evaluation results are reported in this 📑 blog. If you find our work helpful, feel free to give us a cite.
QwQ-32B
QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini. This repo contains the QwQ 32B model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training (Supervised Finetuning and Reinforcement Learning) - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - For prompts exceeding 8,192 tokens in length, you must enable YaRN as outlined in this section. Note: For the best experience, please review the usage guidelines before deploying QwQ models. You can try our demo or access QwQ models via QwenChat. For more details, please refer to our blog, GitHub, and Documentation. QwQ is based on Qwen2.5, whose code has been in the latest Hugging face `transformers`. We advise you to use the latest version of `transformers`. With `transformers \n" to prevent generating empty thinking content, which can degrade output quality. If you use `applychattemplate` and set `addgenerationprompt=True`, this is already automatically implemented, but it may cause the response to lack the \ tag at the beginning. This is normal behavior. 2. Sampling Parameters: - Use Temperature=0.6, TopP=0.95, MinP=0 instead of Greedy decoding to avoid endless repetitions. - Use TopK between 20 and 40 to filter out rare token occurrences while maintaining the diversity of the generated output. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may result in occasional language mixing and a slight decrease in performance. 3. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. This feature is already implemented in `applychattemplate`. 4. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g.,`\"answer\": \"C\"`." in the prompt. 5. Handle Long Inputs: For inputs exceeding 8,192 tokens, enable YaRN to improve the model's ability to capture long-sequence information effectively. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2-Audio-7B
Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes: voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input; audio analysis: users could provide audio and text instructions for analysis during the interaction; We release Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct, which are pretrained model and chat model respectively. For more details, please refer to our Blog, GitHub, and Report. Requirements The code of Qwen2-Audio has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Here provides offers a code snippet illustrating the process of loading both the processor and model, alongside detailed instructions on executing the pretrained Qwen2-Audio base model for content generation. If you find our work helpful, feel free to give us a cite.
Qwen3-Reranker-4B
Qwen2.5-3B-Instruct-GGUF
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 3B Qwen2.5 model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-8B-GGUF
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-8B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 8.2B - Number of Paramaters (Non-Embedding): 6.95B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Check out our ollama documentation for more usage guide. You can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the YaRN method. > [!NOTE] > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. > We advise adding the `ropescaling` configuration only when processing long contexts is required. > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0. > [!TIP] > The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - For thinking mode (`enablethinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. - For non-thinking mode (`enablethinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. - We recommend setting `presencepenalty` to 1.5 for quantized models to suppress repetitive outputs. You can adjust the `presencepenalty` parameter between 0 and 2. A higher value may occasionally lead to language mixing and a slight reduction in model performance. 2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." 4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-4B-Thinking
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deepe...
Qwen-VL-Chat
Qwen-VL 🤗 🤖   | Qwen-VL-Chat 🤗 🤖   (Int4: 🤗 🤖  ) | Qwen-VL-Plus 🤗 🤖   | Qwen-VL-Max 🤗 🤖   Web    |    API    |    WeChat    |    Discord    |    Paper    |    Tutorial Qwen-VL 是阿里云研发的大规模视觉语言模型(Large Vision Language Model, LVLM)。Qwen-VL 可以以图像、文本、检测框作为输入,并以文本和检测框作为输出。Qwen-VL 系列模型性能强大,具备多语言对话、多图交错对话等能力,并支持中文开放域定位和细粒度图像识别与理解。 Qwen-VL (Qwen Large Vision Language Model) is the visual multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text and bounding box. The features of Qwen-VL include: 目前,我们提供了Qwen-VL和Qwen-VL-Chat两个模型,分别为预训练模型和Chat模型。如果想了解更多关于模型的信息,请点击链接查看我们的技术备忘录。本仓库为Qwen-VL-Chat仓库。 We release Qwen-VL and Qwen-VL-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-VL, please refer to our technical memo. This repo is the one for Qwen-VL-Chat. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户需考虑此选项) python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users) 在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。 Below, we provide simple examples to show how to use Qwen-VL-Chat with 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Now you can start with Transformers. More usage aboue vision encoder, please refer to tutorial. To use Qwen-VL-Chat for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code. 当前我们提供了基于AutoGPTQ的量化方案,并提供了Qwen-VL-Chat的Int4量化版本Qwen-VL-Chat-Int4 点击此处。该模型在效果评测上几乎无损,并在显存占用和推理速度上具有明显优势。 下文说明如何使用该量化模型。开始之前,请确保你满足要求(如torch2.0及以上、transformers 4.32.0及以上,等)并安装所需的代码库: We provide a new solution based on AutoGPTQ, and release an Int4 quantized model for Qwen-VL-Chat, Qwen-VL-Chat-Int4 Click here, which achieves nearly lossless model effects but improved performance on both memory costs and inference speed. Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: If you meet problems installing `auto-gptq`, we advise you to check out the official repo to find a wheel. Then you can load the quantized model easily and run inference as same as usual: 我们列出不同精度下模型在评测基准 TouchStone 上的表现,并发现量化模型并没有显著性能损失。结果如下所示: We illustrate the model performance of both BF16 and Int4 models on the benchmark TouchStone, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: | Quantization | ZH. | EN | | ------------ | :--------: | :-----------: | | BF16 | 401.2 | 645.2 | | Int4 | 386.6 | 651.4 | 我们测算了在输入一张图片(即258个token)的条件下BF16和Int4的模型生成1792 (2048-258) 和 7934 (8192-258) 个token的平均速度。 We measured the average inference speed (tokens/s) of generating 1792 (2048-258) and 7934 (8192-258) tokens with the context of an image (which takes 258 tokens) under BF16 precision and Int4 quantization, respectively. | Quantization | Speed (2048 tokens) | Speed (8192 tokens) | | ------------ | :-----------------: | :-----------------: | | BF16 | 28.87 | 24.32 | | Int4 | 37.79 | 34.34 | 推理速度测算是在单卡 A100-SXM4-80G GPU上运行,使用PyTorch 2.0.1及CUDA 11.4。 The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.4. 我们还测算了在一张图片输入的条件下BF16和Int4模型生成1792 (2048-258) 和 7934 (8192-258) 个token所需显存。结果如下所示: We also profile the peak GPU memory usage for encoding 1792 (2048-258) tokens (including an image) as context (and generating single token) and generating 7934 (8192-258) tokens (with an image as context) under BF16 or Int4 quantization level, respectively. The results are shown below. | Quantization | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 22.60GB | 28.01GB | | Int4 | 11.82GB | 17.23GB | The above speed and memory profiling are conducted using this script. - Zero-shot Caption: 评测模型在未见过数据集上的零样本图片描述能力; - General VQA: 评测模型的通用问答能力,例如判断题、颜色、个数、类目等问答能力; - Text-based VQA:评测模型对于图片中文字相关的识别/问答能力,例如文档问答、图表问答、文字问答等; - Referring Expression Compression:评测模型给定物体描述画检测框的能力; 2. 试金石 (TouchStone):为了评测模型整体的图文对话能力和人类对齐水平。我们为此构建了一个基于 GPT4 打分来评测 LVLM 模型的 Benchmark:TouchStone。在 TouchStone-v0.1 中: - 评测基准总计涵盖 300+张图片、800+道题目、27个类别。包括基础属性问答、人物地标问答、影视作品问答、视觉推理、反事实推理、诗歌创作、故事写作,商品比较、图片解题等尽可能广泛的类别。 - 为了弥补目前 GPT4 无法直接读取图片的缺陷,我们给所有的带评测图片提供了人工标注的充分详细描述,并且将图片的详细描述、问题和模型的输出结果一起交给 GPT4 打分。 - 评测同时包含英文版本和中文版本。 We evaluated the model's ability from two perspectives: 1. Standard Benchmarks: We evaluate the model's basic task capabilities on four major categories of multimodal tasks: - Zero-shot Caption: Evaluate model's zero-shot image captioning ability on unseen datasets; - General VQA: Evaluate the general question-answering ability of pictures, such as the judgment, color, number, category, etc; - Text-based VQA: Evaluate the model's ability to recognize text in pictures, such as document QA, chart QA, etc; - Referring Expression Comprehension: Evaluate the ability to localize a target object in an image described by a referring expression. 2. TouchStone: To evaluate the overall text-image dialogue capability and alignment level with humans, we have constructed a benchmark called TouchStone, which is based on scoring with GPT4 to evaluate the LVLM model. - The TouchStone benchmark covers a total of 300+ images, 800+ questions, and 27 categories. Such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc; - In order to break the current limitation of GPT4 in terms of direct image input, TouchStone provides fine-grained image annotations by human labeling. These detailed annotations, along with the questions and the model's output, are then presented to GPT4 for scoring. - The benchmark includes both English and Chinese versions. Qwen-VL outperforms current SOTA generalist models on multiple VL tasks and has a more comprehensive coverage in terms of capability range. 零样本图像描述 & 通用视觉问答 (Zero-shot Captioning & General VQA) NoCaps Flickr30K VQAv2 dev OK-VQA GQA SciQA-Img (0-shot) VizWiz (0-shot) Generalist Models Flamingo-9B - 61.5 51.8 44.7 - - 28.8 BLIP-2 (Vicuna-13B) 103.9 71.6 65.0 45.9 32.3 61.0 19.6 InstructBLIP (Vicuna-13B) 121.9 82.8 - - 49.5 63.1 33.4 Qwen-VL (Qwen-7B) 121.4 85.8 78.8 58.6 59.3 67.1 35.2 Previous SOTA (Per Task Fine-tuning) - 127.0 (PALI-17B) 84.5 (InstructBLIP -FlanT5-XL) 86.1 (PALI-X -55B) 66.1 (PALI-X -55B) 72.1 (CFR) 92.53 (LLaVa+ GPT-4) 70.9 (PALI-X -55B) - 在 Zero-shot Caption 中,Qwen-VL 在 Flickr30K 数据集上取得了 SOTA 的结果,并在 Nocaps 数据集上取得了和 InstructBlip 可竞争的结果。 - 在 General VQA 中,Qwen-VL 取得了 LVLM 模型同等量级和设定下 SOTA 的结果。 - For zero-shot image captioning, Qwen-VL achieves the SOTA on Flickr30K and competitive results on Nocaps with InstructBlip. - For general VQA, Qwen-VL achieves the SOTA under the same generalist LVLM scale settings. Model type Model TextVQA DocVQA ChartQA AI2D OCR-VQA Specialist SOTAs (Specialist/Finetuned) PALI-X-55B (Single-task FT) (Without OCR Pipeline) 71.44 80.0 70.0 81.2 75.0 - 在文字相关的识别/问答评测上,取得了当前规模下通用 LVLM 达到的最好结果。 - 分辨率对上述某几个评测非常重要,大部分 224 分辨率的开源 LVLM 模型无法完成以上评测,或只能通过切图的方式解决。Qwen-VL 将分辨率提升到 448,可以直接以端到端的方式进行以上评测。Qwen-VL 在很多任务上甚至超过了 1024 分辨率的 Pic2Struct-Large 模型。 - In text-related recognition/QA evaluation, Qwen-VL achieves the SOTA under the generalist LVLM scale settings. - Resolution is important for several above evaluations. While most open-source LVLM models with 224 resolution are incapable of these evaluations or can only solve these by cutting images, Qwen-VL scales the resolution to 448 so that it can be evaluated end-to-end. Qwen-VL even outperforms Pic2Struct-Large models of 1024 resolution on some tasks. val test-A test-B val test-A test-B val-u test-u refexp OFA-L 79.96 83.67 76.39 68.29 76.00 61.75 67.57 67.58 61.70 Shikra-7B 87.01 90.61 80.24 81.60 87.36 72.12 82.27 82.19 69.34 Shikra-13B 87.83 91.11 81.81 82.89 87.79 74.41 82.64 83.16 69.03 Qwen-VL-7B 89.36 92.26 85.34 83.12 88.25 77.21 85.58 85.48 78.22 Qwen-VL-7B-Chat 88.55 92.27 84.51 82.82 88.59 76.79 85.96 86.32 - Specialist SOTAs (Specialist/Finetuned) G-DINO-L 90.56 93.19 88.24 82.75 88.95 75.92 86.13 87.02 - UNINEXT-H 92.64 94.33 91.46 85.24 89.63 79.79 88.73 89.37 - ONE-PEACE 92.58 94.18 89.26 88.77 92.21 83.23 89.22 89.27 - - 在定位任务上,Qwen-VL 全面超过 Shikra-13B,取得了目前 Generalist LVLM 模型上在 Refcoco 上的 SOTA。 - Qwen-VL 并没有在任何中文定位数据上训练过,但通过中文 Caption 数据和 英文 Grounding 数据的训练,可以 Zero-shot 泛化出中文 Grounding 能力。 我们提供了以上所有评测脚本以供复现我们的实验结果。请阅读 eval/EVALUATION.md 了解更多信息。 - Qwen-VL achieves the SOTA in all above referring expression comprehension benchmarks. - Qwen-VL has not been trained on any Chinese grounding data, but it can still generalize to the Chinese Grounding tasks in a zero-shot way by training Chinese Caption data and English Grounding data. We provide all of the above evaluation scripts for reproducing our experimental results. Please read eval/EVALUATION.md for more information. TouchStone 是一个基于 GPT4 打分来评测 LVLM 模型的图文对话能力和人类对齐水平的基准。它涵盖了 300+张图片、800+道题目、27个类别,包括基础属性、人物地标、视觉推理、诗歌创作、故事写作、商品比较、图片解题等尽可能广泛的类别。关于 TouchStone 的详细介绍,请参考touchstone/READMECN.md了解更多信息。 TouchStone is a benchmark based on scoring with GPT4 to evaluate the abilities of the LVLM model on text-image dialogue and alignment levels with humans. It covers a total of 300+ images, 800+ questions, and 27 categories, such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc. Please read touchstone/READMECN.md for more information. | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | Qwen-VL-Chat has achieved the best results in both Chinese and English alignment evaluation. If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. 研究人员与开发者可使用Qwen-VL和Qwen-VL-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看LICENSE。如需商用,请填写问卷申请。 Researchers and developers are free to use the codes and model weights of both Qwen-VL and Qwen-VL-Chat. We also allow their commercial use. Check our license at LICENSE for more details. If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) 如果你想给我们的研发团队和产品团队留言,请通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
Qwen2-7B
Qwen3-Coder-480B-A35B-Instruct-FP8
Qwen2.5-14B-Instruct-1M
Qwen2.5-VL-32B-Instruct-AWQ
Latest Updates: In addition to the original formula, we have further enhanced Qwen2.5-VL-32B's mathematical and problem-solving abilities through reinforcement learning. This has also significantly improved the model's subjective user experience, with response styles adjusted to better align with human preferences. Particularly for objective queries such as mathematics, logical reasoning, and knowledge-based Q&A, the level of detail in responses and the clarity of formatting have been noticeably enhanced. In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repository contains the quantized instruction-tuned 32B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Model | MMMU | DocVQAVAL | MMBenchDEVEN | MathVistaMINI | |---------------------------|--------------------|------------|------------------------|----------------| | Qwen2.5-VL-32B-Instruct | 70.0 | 93.9107 | 87.3 | 74.7 | | Qwen2.5-VL-32B-Instruct-AWQ | 67.8 | 94.1489 | 86.9 | 73.6 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: { ..., "type": "yarn", "mropesection": [ 16, 24, 24 ], "factor": 4, "originalmaxpositionembeddings": 32768 } However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.
Qwen1.5-14B
Qwen3-235B-A22B-Thinking-2507
Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-235B-A22B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU1-Retail | 63.9 | 71.8 | 73.9 | 74.8 | - | 54.8 | 67.8 | | TAU1-Airline | 53.5 | 49.2 | 52.0 | 52.0 | - | 26.0 | 46.0 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) bash export MODELNAME=Qwen3-235B-A22B-Thinking-2507 huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME} mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak mv ${MODELNAME}/config1m.json ${MODELNAME}/config.json bash pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly bash VLLMATTENTIONBACKEND=DUALCHUNKFLASHATTN VLLMUSEV1=0 \ vllm serve ./Qwen3-235B-A22B-Thinking-2507 \ --tensor-parallel-size 8 \ --max-model-len 1010000 \ --enable-chunked-prefill \ --max-num-batched-tokens 131072 \ --enforce-eager \ --max-num-seqs 1 \ --gpu-memory-utilization 0.85 \ --enable-reasoning --reasoning-parser deepseekr1 bash git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]" bash python3 -m sglang.launchserver \ --model-path ./Qwen3-235B-A22B-Thinking-2507 \ --context-length 1010000 \ --mem-frac 0.75 \ --attention-backend dualchunkflashattn \ --tp 8 \ --chunked-prefill-size 131072 \ --reasoning-parser deepseek-r1 @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen1.5-MoE-A2.7B
Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data. For more details, please refer to our blog post and GitHub repo. Model Details Qwen1.5-MoE employs Mixture of Experts (MoE) architecture, where the models are upcycled from dense language models. For instance, `Qwen1.5-MoE-A2.7B` is upcycled from `Qwen-1.8B`. It has 14.3B parameters in total and 2.7B activated parameters during runtime, while achieving comparable performance to `Qwen1.5-7B`, it only requires 25% of the training resources. We also observed that the inference speed is 1.74 times that of `Qwen1.5-7B`. Requirements The code of Qwen1.5-MoE has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
Qwen3-VL-30B-A3B-Thinking
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deepe...
Qwen3-1.7B-GGUF
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-1.7B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 1.7B - Number of Paramaters (Non-Embedding): 1.4B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Check out our ollama documentation for more usage guide. You can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - For thinking mode (`enablethinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. - For non-thinking mode (`enablethinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. - We recommend setting `presencepenalty` to 1.5 for quantized models to suppress repetitive outputs. You can adjust the `presencepenalty` parameter between 0 and 2. A higher value may occasionally lead to language mixing and a slight reduction in model performance. 2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." 4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed. If you find our work helpful, feel free to give us a cite.
Qwen2-VL-7B-Instruct-AWQ
We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2-VL model. For more information, visit our Blog and GitHub. Benchmark Performance of Quantized Models This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2-VL series. Specifically, we report: - MMMUVAL (Accuracy) - DocVQAVAL (Accuracy) - MMBenchDEVEN (Accuracy) - MathVistaMINI (Accuracy) | Model Size | Quantization | MMMU | DocVQA | MMBench | MathVista | | --- | --- | --- | --- | --- | --- | | Qwen2-VL-7B-Instruct | BF16 (🤗🤖) | 53.77 | 93.89 | 81.78 | 58.20 | | | GPTQ-Int8 (🤗🤖) | 53.00 | 93.94 | 82.38 | 57.90 | | | GPTQ-Int4 (🤗🤖) | 52.55 | 93.16 | 81.27 | 60.30 | | | AWQ (🤗🤖) | 53.66 | 93.10 | 81.61 | 56.80 | Speed Benchmark This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2-VL series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. The environment of the evaluation with huggingface transformers is: - NVIDIA A100 80GB - CUDA 11.8 - Pytorch 2.2.1+cu118 - Flash Attention 2.6.1 - Transformers 4.38.2 - AutoGPTQ 0.6.0+cu118 - AutoAWQ 0.2.5+cu118 (autoawqkernels 0.0.6+cu118) - We use the batch size of 1 and the least number of GPUs as possible for the evalution. - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens (>32k is only avaliable for Qwen2-72B-Instuct and Qwen2-7B-Instuct). | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | | --- | --- | --- | --- | --- | --- | | Qwen2-VL-7B-Instruct | 1 | BF16 | 1 | 39.02 | 16.07 | | | | GPTQ-Int8 | 1 | 31.60 | 10.11 | | | | GPTQ-Int4 | 1 | 42.76 | 7.20 | | | | AWQ | 1 | 32.08 | 7.07 | | | 6144 | BF16 | 1 | 38.75 | 21.56 | | | | GPTQ-Int8 | 1 | 31.31 | 15.61 | | | | GPTQ-Int4 | 1 | 39.75 | 12.69 | | | | AWQ | 1 | 32.66 | 12.56 | | | 14336 | BF16 | 1 | 30.65 | 29.07 | | | | GPTQ-Int8 | 1 | 27.96 | 23.11 | | | | GPTQ-Int4 | 1 | 29.72 | 20.20 | | | | AWQ | 1 | 31.42 | 20.07 | | | 30720 | BF16 | 1 | 19.53 | 44.08 | | | | GPTQ-Int8 | 1 | 18.37 | 38.13 | | | | GPTQ-Int4 | 1 | 19.15 | 35.22 | | | | AWQ | 1 | 19.95 | 35.08 | Requirements The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Quickstart We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions: 1. Lack of Audio Support: The current model does not comprehend audio information within videos. 2. Data timeliness: Our image dataset is updated until June 2023, and information subsequent to this date may not be covered. 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands. 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement. 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements. 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects. These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application. If you find our work helpful, feel free to give us a cite.
Qwen2.5-0.5B-Instruct-GGUF
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 0.5B Qwen2.5 model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-4B-Instruct-2507-FP8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507-FP8/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-4B-Instruct-2507 ---
Qwen3-Omni-30B-A3B-Thinking
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. Key features: State-of-the-art across modalities: Early text-first pretraining and mixed multimodal training provide native multimodal support. While achieving strong audio and audio-video results, unimodal text and image performance does not regress. Reaches SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36; ASR, audio understanding, and voice conversation performance is comparable to Gemini 2.5 Pro. Multilingual: Supports 119 text languages, 19 speech input languages, and 10 speech output languages. - Speech Input: English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, Urdu. - Speech Output: English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean. Novel Architecture: MoE-based Thinker–Talker design with AuT pretraining for strong general representations, plus a multi-codebook design that drives latency to a minimum. Real-time Audio/Video Interaction: Low-latency streaming with natural turn-taking and immediate text or speech responses. Flexible Control: Customize behavior via system prompts for fine-grained control and easy adaptation. Detailed Audio Captioner: Qwen3-Omni-30B-A3B-Captioner is now open source: a general-purpose, highly detailed, low-hallucination audio captioning model that fills a critical gap in the open-source community. Qwen3-Omni supports a wide range of multimodal application scenarios, covering various domain tasks involving audio, image, video, and audio-visual modalities. Below are several cookbooks demonstrating the usage cases of Qwen3-Omni and these cookbooks include our actual execution logs. You can first follow the QuickStart guide to download the model and install the necessary inference environment dependencies, then run and experiment locally—try modifying prompts or switching model types, and enjoy exploring the capabilities of Qwen3-Omni! Audio Speech Recognition Speech recognition, supporting multiple languages and long audio. Speech Translation Speech-to-Text / Speech-to-Speech translation. Music Analysis Detailed analysis and appreciation of any music, including style, genre, rhythm, etc. Sound Analysis Description and analysis of various sound effects and audio signals. Audio Caption Audio captioning, detailed description of any audio input. Mixed Audio Analysis Analysis of mixed audio content, such as speech, music, and environmental sounds. Image Question Answering arbitrary questions about any image. Image Math Solving complex mathematical problems in images, highlighting the capabilities of the Thinking model. Video Description Detailed description of video content. Video Navigation Generating navigation commands from first-person motion videos. Video Scene Transition Analysis of scene transitions in videos. Audio-Visual Audio Visual Question Answering arbitrary questions in audio-visual scenarios, demonstrating the model's ability to model temporal alignment between audio and video. Audio Visual Interaction Interactive communication with the model using audio-visual inputs, including task specification via audio. Audio Visual Dialogue Conversational interaction with the model using audio-visual inputs, showcasing its capabilities in casual chat and assistant-like behavior. Agent Audio Function Call Using audio input to perform function calls, enabling agent-like behaviors. Downstream Task Fine-tuning Omni Captioner Introduction and capability demonstration of Qwen3-Omni-30B-A3B-Captioner , a downstream fine-tuned model based on Qwen3-Omni-30B-A3B-Instruct, illustrating the strong generalization ability of the Qwen3-Omni foundation model. Below is the description of all Qwen3-Omni models. Please select and download the model that fits your needs. | Model Name | Description | |------------------------------|-------------| | Qwen3-Omni-30B-A3B-Instruct | The Instruct model of Qwen3-Omni-30B-A3B, containing both thinker and talker, supporting audio, video, and text input, with audio and text output. For more information, please read the Qwen3-Omni Technical Report. | | Qwen3-Omni-30B-A3B-Thinking | The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. For more information, please read the Qwen3-Omni Technical Report.| | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's cookbook. | During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory: The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you create a new Python environment to avoid environment runtime issues. We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using vLLM for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default. Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. Here is a code snippet to show you how to use Qwen3-Omni with `transformers` and `qwenomniutils`: Here are some more advanced usage examples. You can expand the sections below to learn more. The model can batch inputs composed of mixed samples of various types such as text, images, audio, and videos as input when `returnaudio=False` is set. Here is an example. The model supports both text and audio outputs. If users do not need audio outputs, they can call `model.disabletalker()` after initializing the model. This option will save about `10GB` of GPU memory, but the `returnaudio` option for the `generate` function will only allow `False`. For a more flexible experience, we recommend that users decide whether to return audio when the `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs, resulting in faster text responses. Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows: | Voice Type | Gender | Description | |------------|--------|-------------| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe. | | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. | | Aiden | Male | A warm, laid-back American voice with a gentle, boyish charm. | Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`. We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, and audio output inference support for the Instruct model will be released in the near future, you can follow the commands below to install vLLM from source. Please note that we recommend you create a new Python environment to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the vLLM official documentation. You can use the following code for vLLM inference. The `limitmmperprompt` parameter specifies the maximum number of each modality's data allowed per message. Since vLLM needs to pre-allocate GPU memory, larger values will require more GPU memory; if OOM issues occur, try reducing this value. Setting `tensorparallelsize` greater than one enables multi-GPU parallel inference, improving concurrency and throughput. In addition, `maxnumseqs` indicates the number of sequences that vLLM processes in parallel during each inference step. A larger value requires more GPU memory but enables higher batch inference speed. For more details, please refer to the vLLM official documentation. Below is a simple example of how to run Qwen3-Omni with vLLM: Here are some more advanced usage examples. You can expand the sections below to learn more. Using vLLM enables fast batch inference, which can help you efficiently process large volumes of data or conduct benchmarking. Refer to the following code example: vLLM serve for Qwen3-Omni currently only supports the thinker model. The `useaudioinvideo` parameter is not available in vLLM serve; you can handle this by separately passing video and audio inputs for processing. You can start vLLM serve through the following command: Then you can use the chat API as below (via curl, for example): | Model | Precision | 15s Video | 30s Video | 60s Video | 120s Video | |------------------------------|-----------| --------- | --------- | --------- | --------- | | Qwen3-Omni-30B-A3B-Instruct | BF16 | 78.85 GB | 88.52 GB | 107.74 GB | 144.81 GB | | Qwen3-Omni-30B-A3B-Thinking | BF16 | 68.74 GB | 77.79 GB | 95.76 GB | 131.65 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` precision, tested with `attnimplementation="flashattention2"`. The Instruct model includes both the thinker and talker components, whereas the Thinking model includes only the thinker part. When using Qwen3-Omni for audio-visual multimodal interaction, where the input consists of a video and its corresponding audio (with the audio serving as a query), we recommend using the following system prompt. This setup helps the model maintain high reasoning capability while better assuming interactive roles such as a smart assistant. Additionally, the text generated by the thinker will be more readable, with a natural, conversational tone and without complex formatting that is difficult to vocalize, leading to more stable and fluent audio output from the talker. You can customize the `usersystemprompt` field in the system prompt to include character settings or other role-specific descriptions as needed. The `Qwen3-Omni-30B-A3B-Thinking` model is primarily designed for understanding and interacting with multimodal inputs, including text, audio, image, and video. To achieve optimal performance, we recommend that users include an explicit textual instruction or task description in each round of dialogue alongside the multimodal input. This helps clarify the intent and significantly enhances the model's ability to leverage its reasoning capabilities. For example: In multimodal interaction, user-provided videos are often accompanied by audio (such as spoken questions or sounds from events in the video). This information helps the model provide a better interactive experience. We provide the following options for users to decide whether to use the audio from a video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter must be set consistently across these steps; otherwise, unexpected results may occur. Qwen3-Omni maintains state-of-the-art performance on text and visual modalities without degradation relative to same-size single-model Qwen counterparts. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and sets the SOTA on 22, outperforming strong closed-source systems such as Gemini 2.5 Pro and GPT-4o. GPT-4o-0327 Qwen3-235B-A22B Non Thinking Qwen3-30B-A3B-Instruct-2507 Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Multilingual Tasks MultiIF 70.4 70.2 67.9 64.0 64.7 Gemini-2.5-Flash Thinking Qwen3-235B-A22B Thinking Qwen3-30B-A3B-Thinking-2507 Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Multilingual Tasks MultiIF 74.4 71.9 76.4 72.9 73.2 Seed-ASR Voxtral-Mini Voxtral-Small GPT-4o-Transcribe Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Wenetspeech net | meeting 4.66 | 5.69 24.30 | 31.53 20.33 | 26.08 15.30 | 32.27 14.43 | 13.47 5.91 | 7.65 4.69 | 5.89 4.62 | 5.75 Librispeech clean | other 1.58 | 2.84 1.88 | 4.12 1.56 | 3.30 1.39 | 3.75 2.89 | 3.56 1.74 | 3.45 1.22 | 2.48 1.27 | 2.44 Fleurs-avg (19 lang) - 15.67 8.09 4.48 5.55 14.04 5.33 5.31 MIR-1K (vocal-only) 6.45 23.33 18.73 11.87 9.85 8.15 5.90 5.85 Opencpop-test 2.98 31.01 16.06 7.93 6.49 2.84 1.54 2.02 Fleurs-en2xx - 30.35 37.85 - 39.25 29.22 37.50 36.22 Fleurs-xx2en - 27.54 32.81 - 35.41 28.61 31.08 30.71 Fleurs-zh2xx - 17.03 22.05 - 26.63 17.97 25.17 25.10 Fleurs-xx2zh - 28.75 34.82 - 37.50 27.68 33.13 31.19 GPT-4o-Audio Gemini-2.5-Flash Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Instruct Qwen3-Omni-Flash-Thinking MMAU-v05.15.25 62.5 71.8 77.4 65.5 77.5 75.4 77.6 76.5 Best Specialist Models GPT-4o-Audio Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct RUL-MuchoMusic 47.6 (Audio Flamingo 3) 36.1 49.4 47.3 52.0 52.1 MTG Genre Micro F1 35.8 (MuQ-MuLan) 25.3 32.6 32.5 39.0 39.5 MTG Mood/Theme Micro F1 10.9 (MuQ-MuLan) 11.3 14.1 8.9 21.0 21.7 MTG Instrument Micro F1 39.8 (MuQ-MuLan) 34.2 33.0 22.6 40.5 40.7 MTG Top50 Micro F1 33.2 (MuQ-MuLan) 25.0 26.1 21.6 36.7 36.9 MagnaTagATune Micro F1 41.6 (MuQ) 29.2 28.1 30.1 44.3 46.8 Datasets GPT4-o Gemini-2.0-Flash Qwen2.5-VL 72B Qwen3-Omni-30B-A3B -Instruct Qwen3-Omni-Flash -Instruct Datasets Gemini-2.5-flash-thinking InternVL-3.5-241B-A28B Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Datasets Previous Open-source SoTA Gemini-2.5-Flash Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Datasets Previous Open-source SoTA Gemini-2.5-Flash-Thinking Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Qwen3-Omni-30B-A3B MiniMax ElevenLabs Qwen3-Omni-30B-A3B MiniMax ElevenLabs Decoding Strategy: For the Qwen3-Omni series across all evaluation benchmarks, `Instruct` models use greedy decoding during generation without sampling. For `Thinking` models, the decoding parameters should be taken from the `generationconfig.json` file in the checkpoint. Benchmark-Specific Formatting: For the majority of evaluation benchmarks, they come with their own ChatML formatting to embed the question or prompt. It should be noted that all video data are set to `fps=2` during evaluation. Default Prompts: For tasks in certain benchmarks that do not include a prompt, we use the following prompt settings: | Task Type | Prompt | | :--- | :--- | | Auto Speech Recognition (ASR) for Chinese | 请将这段中文语音转换为纯文本。 | | Auto Speech Recognition (ASR) for Other languages | Transcribe the audio into text. | | Speech-to-Text Translation (S2TT) | Listen to the provided speech and produce a translation in text. | | Song Lyrics Recognition | Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations. | System Prompt: No `system prompt` should be set for any evaluation benchmark. Input Sequence: The question or prompt should be input as user text. Unless otherwise specified by the benchmark, the text should come after multimodal data in the sequence. For example:
Qwen2.5-1.5B-Instruct-GGUF
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 1.5B Qwen2.5 model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 1.54B - Number of Paramaters (Non-Embedding): 1.31B - Number of Layers: 28 - Number of Attention Heads (GQA): 12 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-1.7B-FP8
Qwen3Guard-Gen-8B
Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that frames safety classification as an instruction-following task, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation. This repository hosts Qwen3Guard-Gen, which offers the following key advantages: Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios. Multilingual Support: Qwen3Guard-Gen supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications. Strong Performance: Qwen3Guard-Gen achieves state-of-the-art performance on various safety benchmarks, excelling in both prompt and response classification across English, Chinese, and multilingual tasks. For more details, please refer to our blog, GitHub, and Technical Report. The latest version of `transformers` is recommended and `transformers>=4.51.0` is required. For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.9.0` or to create an OpenAI-compatible API endpoint: Here is an example API call using OpenAI-Compatible server: In Qwen3Guard, potential harms are classified into three severity levels: Unsafe: Content generally considered harmful across most scenarios. Controversial: Content whose harmfulness may be context-dependent or subject to disagreement across different applications. Safe: Content generally considered safe across most scenarios. In the current version of Qwen3Guard, we consider the following safety categories: Violent: Content that provides detailed instructions, methods, or advice on how to commit acts of violence, including the manufacture, acquisition, or use of weapons. Also includes depictions of violence. Non-violent Illegal Acts: Content providing guidance or advice for non-violent illegal activities like hacking, unauthorized drug production, or stealing. Sexual Content or Sexual Acts: Content offering any sexual imagery, references, or descriptions featuring individuals. Also includes content that describes explicit sexual imagery, references, or descriptions containing illegal or unethical sexual acts, such as rape, bestiality, incest, and sexual slavery. Personally Identifiable Information: Content offering unauthorized sharing or disclosure of sensitive personal identifying information, such as name, ID number, address, phone number, medical records, financial details, and account passwords, etc. Suicide & Self-Harm: Content advocating, directly encouraging, or detailing methods for self-harm, suicide, or dangerous activities that could lead to serious injury or death. Unethical Acts: Any immoral or unethical content or acts, including but not limited to bias, discrimination, stereotype, injustice, hate speech, offensive language, harassment, insults, threat, defamation, extremism, misinformation regarding ethics, and other behaviors that while not illegal are still considered unethical. Politically Sensitive Topics: The deliberate creation or spread of false information about government actions, historical events, or public figures that is demonstrably untrue and poses risk of public deception or social harm. Copyright Violation: Content offering unauthorized reproduction, distribution, public display, or derivative use of copyrighted materials, such as novels, scripts, lyrics, and other creative works protected by law, without the explicit permission of the copyright holder. Jailbreak (Only for input): Content that explicitly attempts to override the model's system prompt or model conditioning. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-235B-A22B-Thinking-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-235B-A22B-Thinking model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original BF16 model. Enjoy! Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for the FP8 version of Qwen3-VL-235B-A22B-Thinking. Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned! We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this link. Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the community deployment guide. Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally. If you find our work helpful, feel free to give us a cite.
Qwen1.5-0.5B
Qwen3-Omni-30B-A3B-Captioner
Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B-Captioner is a powerful fine-grained audio analysis model, built upon the Qwen3-Omni-30B-A3B-Instruct base model. It is specifically designed to generate accurate and comprehensive content descriptions in complex and diverse audio scenarios. Without requiring any additional prompting, the model can automatically parse and describe various types of audio content, ranging from complex speech and environmental sounds to music and cinematic sound effects, delivering stable and reliable outputs even in multi-source, mixed audio environments. In terms of speech understanding, Qwen3-Omni-30B-A3B-Captioner excels at identifying multiple speaker emotions, multilingual expressions, and layered intentions. It can also perceive cultural context and implicit information within the audio, enabling a deep comprehension of the underlying meaning behind the spoken words. In non-speech scenarios, the model demonstrates exceptional sound recognition and analysis capabilities, accurately distinguishing and describing intricate layers of real-world sounds, ambient atmospheres, and dynamic audio details in film and media. Note: Qwen3-Omni-30B-A3B-Captioner is a single-turn model that accepts only one audio input per inference. It does not accept any text prompts and supports audio input only, with text output only. As Qwen3-Omni-30B-A3B-Captioner is designed for generating fine‑grained descriptions of audio, excessively long audio clips may diminish detail perception. We recommend, as a best practice, limiting audio length to no more than 30 seconds. | Model Name | Description | |------------------------------|-------------| | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's cookbook or Hugging Face Demo and ModelScope Demo. | During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory: The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you create a new Python environment to avoid environment runtime issues. We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using vLLM for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default. Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. Here is a code snippet to show you how to use Qwen3-Omni-30B-A3B-Captioner with `transformers` and `qwenomniutils`: We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, you can follow the commands below to install vLLM from source. Please note that we recommend you create a new Python environment to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the vLLM official documentation. Below is a simple example of how to run Qwen3-Omni-30B-A3B-Captioner with vLLM: You can start vLLM serve through the following command: Then you can use the API as below (via curl, for example):
Qwen3-VL-32B-Instruct-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-32B-Instruct model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original BF16 model. Enjoy! Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct-FP8. Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned! We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this link. Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the community deployment guide. Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally. If you find our work helpful, feel free to give us a cite.
Qwen3-Reranker-8B
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. - Model Type: Text Reranking - Supported Languages: 100+ Languages - Number of Paramaters: 8B - Context Length: 32k For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. With Transformers versions earlier than 4.51.0, you may encounter the following error: 📌 Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. | Model | Param | MTEB-R | CMTEB-R | MMTEB-R | MLDR | MTEB-Code | FollowIR | |------------------------------------|--------|---------|---------|---------|--------|-----------|----------| | Qwen3-Embedding-0.6B | 0.6B | 61.82 | 71.02 | 64.64 | 50.26 | 75.41 | 5.09 | | Jina-multilingual-reranker-v2-base | 0.3B | 58.22 | 63.37 | 63.73 | 39.66 | 58.98 | -0.68 | | gte-multilingual-reranker-base | 0.3B | 59.51 | 74.08 | 59.44 | 66.33 | 54.18 | -1.64 | | BGE-reranker-v2-m3 | 0.6B | 57.03 | 72.16 | 58.36 | 59.51 | 41.38 | -0.01 | | Qwen3-Reranker-0.6B | 0.6B | 65.80 | 71.31 | 66.36 | 67.28 | 73.42 | 5.41 | | Qwen3-Reranker-4B | 4B | 69.76 | 75.94 | 72.74 | 69.97 | 81.20 | 14.84 | | Qwen3-Reranker-8B | 8B | 69.02 | 77.45 | 72.94 | 70.19 | 81.22 | 8.05 | > Note: > - Evaluation results for reranking models. We use the retrieval subsets of MTEB(eng, v2), MTEB(cmn, v1), MMTEB and MTEB (Code), which are MTEB-R, CMTEB-R, MMTEB-R and MTEB-Code. > - All scores are our runs based on the top-100 candidates retrieved by dense embedding model Qwen3-Embedding-0.6B. Citation If you find our work helpful, feel free to give us a cite.
Qwen2.5-Math-1.5B-Instruct
Qwen2.5-Coder-32B-Instruct-AWQ
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the AWQ-quantized 4-bit instruction-tuned 32B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: AWQ 4-bit For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our AWQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Math-RM-72B
Introduction Qwen2.5-Math-RM-72B is specifically designed to guide the Qwen2.5-Math model throughout the training process by offering more granular feedback on the quality of reasoning and intermediate steps, ultimately facilitating more robust model improvements. - Multilingual and Multi-Modal Support: Offers preference signals across two languages (Chinese and English) and in dual modes (Chain-of-Thought and Tool-integrated Reasoning), enhancing versatility. - Model Training Guide: - Training Data Enhancement: Employs a data selection process via reward model scoring combined with Rejection Sampling to incrementally enhance the quality of responses - Reinforcement Learning Training: Integrates seamlessly into the reinforcement learning training and provide effective reward signal, further improving model performance. - Inference Boosting: - Best of N: By leveraging a combination of response sampling and Best-of-N strategies, we choose the response of top score judged by reward model, yielding better results with spending more inference time. For example, Qwen2.5-Math-1.5B-Instruct obtains 83.9 on MATH in RM@8 setting and even surpasses the performance of Qwen2.5-Math-7B-Instruct 83.6 with greedy decoding. - Comparasion with majority voting (Maj@N): RM@N scores are substantially better than Maj@N scores aross almost all benchmarks and models. For more details, please refer to our blog post, Technical Report and GitHub repo. Requirements `transformers>=4.40.0` for Qwen2.5-Math models. The latest version is recommended. > [!Warning] > > > 🚨 This is a must because `transformers` integrated Qwen2.5 codes since `4.37.0`. > > For requirements on GPU memory and the respective throughput, see similar results of Qwen2 here. > [!Important] > > Qwen2.5-Math-RM-72B is a reward model typically used for offering feedback on the quality of reasoning and intermediate steps, serving in Rejection Sampling, reinforcement learning training and RM@N. Here we show a code snippet to show you how to use the Qwen2.5-Math-RM-72B with `transformers`: If you find our work helpful, feel free to give us a citation.
Qwen-7B-Chat
Qwen2-VL-7B-Instruct-GPTQ-Int4
We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2-VL model. For more information, visit our Blog and GitHub. Benchmark Performance of Quantized Models This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2-VL series. Specifically, we report: - MMMUVAL (Accuracy) - DocVQAVAL (Accuracy) - MMBenchDEVEN (Accuracy) - MathVistaMINI (Accuracy) | Model Size | Quantization | MMMU | DocVQA | MMBench | MathVista | | --- | --- | --- | --- | --- | --- | | Qwen2-VL-7B-Instruct | BF16 (🤗🤖) | 53.77 | 93.89 | 81.78 | 58.20 | | | GPTQ-Int8 (🤗🤖) | 53.00 | 93.94 | 82.38 | 57.90 | | | GPTQ-Int4 (🤗🤖) | 52.55 | 93.16 | 81.27 | 60.30 | | | AWQ (🤗🤖) | 53.66 | 93.10 | 81.61 | 56.80 | Speed Benchmark This section reports the speed performance of bf16 models, quantized models (including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2-VL series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. The environment of the evaluation with huggingface transformers is: - NVIDIA A100 80GB - CUDA 11.8 - Pytorch 2.2.1+cu118 - Flash Attention 2.6.1 - Transformers 4.38.2 - AutoGPTQ 0.6.0+cu118 - AutoAWQ 0.2.5+cu118 (autoawqkernels 0.0.6+cu118) - We use the batch size of 1 and the least number of GPUs as possible for the evalution. - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens (>32k is only avaliable for Qwen2-72B-Instuct and Qwen2-7B-Instuct). | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | | --- | --- | --- | --- | --- | --- | | Qwen2-VL-7B-Instruct | 1 | BF16 | 1 | 39.02 | 16.07 | | | | GPTQ-Int8 | 1 | 31.60 | 10.11 | | | | GPTQ-Int4 | 1 | 42.76 | 7.20 | | | | AWQ | 1 | 32.08 | 7.07 | | | 6144 | BF16 | 1 | 38.75 | 21.56 | | | | GPTQ-Int8 | 1 | 31.31 | 15.61 | | | | GPTQ-Int4 | 1 | 39.75 | 12.69 | | | | AWQ | 1 | 32.66 | 12.56 | | | 14336 | BF16 | 1 | 30.65 | 29.07 | | | | GPTQ-Int8 | 1 | 27.96 | 23.11 | | | | GPTQ-Int4 | 1 | 29.72 | 20.20 | | | | AWQ | 1 | 31.42 | 20.07 | | | 30720 | BF16 | 1 | 19.53 | 44.08 | | | | GPTQ-Int8 | 1 | 18.37 | 38.13 | | | | GPTQ-Int4 | 1 | 19.15 | 35.22 | | | | AWQ | 1 | 19.95 | 35.08 | Requirements The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Quickstart We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions: 1. Lack of Audio Support: The current model does not comprehend audio information within videos. 2. Data timeliness: Our image dataset is updated until June 2023, and information subsequent to this date may not be covered. 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands. 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement. 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements. 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects. These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application. If you find our work helpful, feel free to give us a cite.
Qwen1.5-MoE-A2.7B-Chat
Qwen2.5-72B-Instruct-GPTQ-Int4
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the GPTQ-quantized 4-bit instruction-tuned 72B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 72.7B - Number of Paramaters (Non-Embedding): 70.0B - Number of Layers: 80 - Number of Attention Heads (GQA): 64 for Q and 8 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: GPTQ 4-bit For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our GPTQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen1.5-1.8B-Chat
Qwen2.5-72B
Qwen3-30B-A3B-FP8
Qwen3-VL-2B-Thinking
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
Qwen2-72B-Instruct
Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the instruction-tuned 72B Qwen2 model. Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc. Qwen2-72B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to this section for detailed instructions on how to deploy Qwen2 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. Model Details Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. Training details We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization. Requirements The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps: 1. Install vLLM: You can install vLLM by running the following command. 2. Configure Model Settings: After downloading the model weights, modify the `config.json` file by including the below snippet: This snippet enable YARN to support longer contexts. 3. Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command: For further usage instructions of vLLM, please refer to our Github. Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. We briefly compare Qwen2-72B-Instruct with similar-sized instruction-tuned LLMs, including our previous Qwen1.5-72B-Chat. The results are shown as follows: | Datasets | Llama-3-70B-Instruct | Qwen1.5-72B-Chat | Qwen2-72B-Instruct | | :--- | :---: | :---: | :---: | | English | | | | | MMLU | 82.0 | 75.6 | 82.3 | | MMLU-Pro | 56.2 | 51.7 | 64.4 | | GPQA | 41.9 | 39.4 | 42.4 | | TheroemQA | 42.5 | 28.8 | 44.4 | | MT-Bench | 8.95 | 8.61 | 9.12 | | Arena-Hard | 41.1 | 36.1 | 48.1 | | IFEval (Prompt Strict-Acc.) | 77.3 | 55.8 | 77.6 | | Coding | | | | | HumanEval | 81.7 | 71.3 | 86.0 | | MBPP | 82.3 | 71.9 | 80.2 | | MultiPL-E | 63.4 | 48.1 | 69.2 | | EvalPlus | 75.2 | 66.9 | 79.0 | | LiveCodeBench | 29.3 | 17.9 | 35.7 | | Mathematics | | | | | GSM8K | 93.0 | 82.7 | 91.1 | | MATH | 50.4 | 42.5 | 59.7 | | Chinese | | | | | C-Eval | 61.6 | 76.1 | 83.8 | | AlignBench | 7.42 | 7.28 | 8.27 | If you find our work helpful, feel free to give us a cite.
Qwen2.5-7B-Instruct-GPTQ-Int4
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the GPTQ-quantized 4-bit instruction-tuned 7B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: GPTQ 4-bit For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our GPTQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-14B-Instruct-GPTQ-Int4
Qwen2.5-Coder-3B-Instruct-GGUF
Qwen2.5-Coder-7B-Instruct-GGUF
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the instruction-tuned 7B Qwen2.5-Coder model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 32,768 tokens - Note: Currently, only vLLM supports YARN for length extrapolating. If you want to process sequences up to 131,072 tokens, please refer to non-GGUF models. - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, Documentation, Arxiv. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For large files, we split them into multiple segments due to the limitation of file upload. They share a prefix, with a suffix indicating its index. For examples, `qwen2.5-coder-7b-instruct-q5km-00001-of-00002.gguf` and `qwen2.5-coder-7b-instruct-q5km-00002-of-00002.gguf`. You need to download all of them. 3. (Optional) Merge: For split files, you need to merge them first with the command `llama-gguf-split` as shown below: For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-7B
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the 7B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., or fill in the middle tasks on this model. For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-0.5B
Qwen2-7B-Instruct-AWQ
Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the instruction-tuned 7B Qwen2 model. Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc. Qwen2-7B-Instruct-AWQ supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to this section for detailed instructions on how to deploy Qwen2 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. Model Details Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. Training details We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization. Requirements The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps: 1. Install vLLM: You can install vLLM by running the following command. 2. Configure Model Settings: After downloading the model weights, modify the `config.json` file by including the below snippet: This snippet enable YARN to support longer contexts. 3. Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command: For further usage instructions of vLLM, please refer to our Github. Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our Benchmark of Quantized Models. This benchmark provides insights into how different quantization techniques affect model performance. For those interested in understanding the inference speed and memory consumption when deploying these models with either ``transformer`` or ``vLLM``, we have compiled an extensive Speed Benchmark. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Omni-7B-AWQ
Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. This model card introduces a series of enhancements designed to improve the Qwen2.5-Omni-7B's operability on devices with constrained GPU memory. Key optimizations include: Implemented 4-bit quantization of the Thinker's weights using AWQ, effectively reducing GPU VRAM usage. Enhanced the inference pipeline to load model weights on-demand for each module and offload them to CPU memory once inference is complete, preventing peak VRAM usage from becoming excessive. Converted the token2wav module to support streaming inference, thereby avoiding the pre-allocation of excessive GPU memory. Adjusted the ODE solver from a second-order (RK4) to a first-order (Euler) method to further decrease computational overhead. These improvements aim to ensure efficient performance of Qwen2.5-Omni across a range of hardware configurations, particularly those with lower GPU memory availability (RTX3080, 4080, 5070, etc). Below, we provide simple example to show how to use Qwen2.5-Omni-7B-AWQ with `autoawq` as follows: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. The following two tables present a performance comparison and GPU memory consumption between Qwen2.5-Omni-7B-AWQ and Qwen2.5-Omni-7B on specific evaluation benchmarks. The data demonstrates that the AWQ model maintains comparable performance while reducing GPU memory requirements by over 50%+, enabling a broader range of devices to run and experience the high-performance Qwen2.5-Omni-7B model. Notably, the AWQ variant exhibits slightly slower inference speeds compared to the native Qwen2.5-Omni-7B model due to quantization techniques and CPU offload mechanisms. | Evaluation Set | Task | Metrics | Qwen2.5-Omni-7B | Qwen2.5-Omni-7B-AWQ | |--------------|-----------| ------------- | ------------- | ------------------ | | LibriSpeech test-other | ASR | WER ⬇️ | 3.4 | 3.91 | | WenetSpeech test-net | ASR | WER ⬇️ | 5.9 | 6.31 | | Seed-TTS test-hard | TTS (Speaker: Chelsie)| WER ⬇️ | 8.7 | 8.88 | | MMLU-Pro | Text -> Text | Accuracy ⬆️ | 47.0 | 45.66 | | OmniBench | Speech -> Text | Accuracy ⬆️ | 56.13 | 54.64 | | VideoMME | Multimodality -> Text | Accuracy ⬆️ | 72.4 | 72.0 | |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | | Qwen-Omni-7B | AWQ | 11.77 GB | 17.84 GB | 30.31 GB | If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
Qwen2.5-7B-Instruct-1M
Qwen2.5-1M is the long-context version of the Qwen2.5 series models, supporting a context length of up to 1M tokens. Compared to the Qwen2.5 128K version, Qwen2.5-1M demonstrates significantly improved performance in handling long-context tasks while maintaining its capability in short tasks. The model has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 1,010,000 tokens and generation 8192 tokens - We recommend deploying with our custom vLLM, which introduces sparse attention and length extrapolation methods to ensure efficiency and accuracy for long-context tasks. For specific guidance, refer to this section. - You can also use the previous framework that supports Qwen2.5 for inference, but accuracy degradation may occur for sequences exceeding 262,144 tokens. For more details, please refer to our blog, GitHub, Technical Report, and Documentation. Requirements The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. - For processing 1 million-token sequences: - Qwen2.5-7B-Instruct-1M: At least 120GB VRAM (total across GPUs). - Qwen2.5-14B-Instruct-1M: At least 320GB VRAM (total across GPUs). If your GPUs do not have sufficient VRAM, you can still use Qwen2.5-1M for shorter tasks. For now, you need to clone the vLLM repository from our custom branch and install it manually. We are working on getting our branch merged into the main vLLM project. vLLM supports offline inference or launch an openai-like server. Then you can use curl or python to interact with the deployed model. - `--tensor-parallel-size` - Set to the number of GPUs you are using. Max 4 GPUs for the 7B model, and 8 GPUs for the 14B model. - `--max-model-len` - Defines the maximum input sequence length. Reduce this value if you encounter Out of Memory issues. - `--max-num-batched-tokens` - Sets the chunk size in Chunked Prefill. A smaller value reduces activation memory usage but may slow down inference. - Recommend 131072 for optimal performance. - `--max-num-seqs` - Limits concurrent sequences processed. You can also refer to our Documentation for usage of vLLM. 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." The VRAM reserved for the KV cache is insufficient. Consider reducing the ``maxmodellen`` or increasing the ``tensorparallelsize``. Alternatively, you can reduce ``maxnumbatchedtokens``, although this may significantly slow down inference. 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory." The VRAM reserved for activation weights is insufficient. You can try setting ``gpumemoryutilization`` to 0.85 or lower, but be aware that this might reduce the VRAM available for the KV cache. 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." The input is too lengthy. Consider using a shorter sequence or increasing the ``maxmodellen``. Detailed evaluation results are reported in this 📑 blog and our technical report. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-0.5B-Instruct
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. This repo contains the instruction-tuned 0.5B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-30B-A3B-Thinking-2507-FP8
Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. This repo contains the FP8 version of Qwen3-30B-A3B-Thinking-2507, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Gemini2.5-Flash-Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B Thinking | Qwen3-30B-A3B-Thinking-2507 | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 81.9 | 82.8 | 78.5 | 80.9 | | MMLU-Redux | 92.1 | 92.7 | 89.5 | 91.4 | | GPQA | 82.8 | 71.1 | 65.8 | 73.4 | | SuperGPQA | 57.8 | 60.7 | 51.8 | 56.8 | | Reasoning | | | | | | AIME25 | 72.0 | 81.5 | 70.9 | 85.0 | | HMMT25 | 64.2 | 62.5 | 49.8 | 71.4 | | LiveBench 20241125 | 74.3 | 77.1 | 74.3 | 76.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 61.2 | 55.7 | 57.4 | 66.0 | | CFEval | 1995 | 2056 | 1940 | 2044 | | OJBench | 23.5 | 25.6 | 20.7 | 25.1 | | Alignment | | | | | | IFEval | 89.8 | 83.4 | 86.5 | 88.9 | | Arena-Hard v2$ | 56.7 | 61.5 | 36.3 | 56.0 | | Creative Writing v3 | 85.0 | 84.6 | 79.1 | 84.4 | | WritingBench | 83.9 | 80.3 | 77.0 | 85.0 | | Agent | | | | | | BFCL-v3 | 68.6 | 70.8 | 69.1 | 72.4 | | TAU1-Retail | 65.2 | 54.8 | 61.7 | 67.8 | | TAU1-Airline | 54.0 | 26.0 | 32.0 | 48.0 | | TAU2-Retail | 66.7 | 40.4 | 34.2 | 58.8 | | TAU2-Airline | 52.0 | 30.0 | 36.0 | 58.0 | | TAU2-Telecom | 31.6 | 21.9 | 22.8 | 26.3 | | Multilingualism | | | | | | MultiIF | 74.4 | 71.9 | 72.2 | 76.4 | | MMLU-ProX | 80.2 | 80.0 | 73.1 | 76.4 | | INCLUDE | 83.9 | 78.7 | 71.9 | 74.4 | | PolyMATH | 49.8 | 54.7 | 46.1 | 52.6 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-30b-a3b-thinking-2507-FP8', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 --served-model-name Qwen3-30B-A3B-Thinking-2507-FP8 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-30B-A3B-Thinking-2507-FP8', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen-VL
Qwen-VL 🤗 🤖   | Qwen-VL-Chat 🤗 🤖   (Int4: 🤗 🤖  ) | Qwen-VL-Plus 🤗 🤖   | Qwen-VL-Max 🤗 🤖   Web    |    API    |    WeChat    |    Discord    |    Paper    |    Tutorial Qwen-VL 是阿里云研发的大规模视觉语言模型(Large Vision Language Model, LVLM)。Qwen-VL 可以以图像、文本、检测框作为输入,并以文本和检测框作为输出。Qwen-VL 系列模型性能强大,具备多语言对话、多图交错对话等能力,并支持中文开放域定位和细粒度图像识别与理解。 Qwen-VL (Qwen Large Vision Language Model) is the visual multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-VL accepts image, text, and bounding box as inputs, outputs text and bounding box. The features of Qwen-VL include: 目前,我们提供了Qwen-VL和Qwen-VL-Chat两个模型,分别为预训练模型和Chat模型。如果想了解更多关于模型的信息,请点击链接查看我们的技术备忘录。本仓库为Qwen-VL-Chat仓库。 We release Qwen-VL and Qwen-VL-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-VL, please refer to our technical memo. This repo is the one for Qwen-VL. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户需考虑此选项) python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users) 在开始前,请确保你已经配置好环境并安装好相关的代码包。最重要的是,确保你满足上述要求,然后安装相关的依赖库。 Below, we provide simple examples to show how to use Qwen-VL with 🤗 Transformers. Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Now you can start with Transformers. More usage aboue vision encoder, please refer to tutorial. To use Qwen-VL for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code. - Zero-shot Caption: 评测模型在未见过数据集上的零样本图片描述能力; - General VQA: 评测模型的通用问答能力,例如判断题、颜色、个数、类目等问答能力; - Text-based VQA:评测模型对于图片中文字相关的识别/问答能力,例如文档问答、图表问答、文字问答等; - Referring Expression Compression:评测模型给定物体描述画检测框的能力; 2. 试金石 (TouchStone):为了评测模型整体的图文对话能力和人类对齐水平。我们为此构建了一个基于 GPT4 打分来评测 LVLM 模型的 Benchmark:TouchStone。在 TouchStone-v0.1 中: - 评测基准总计涵盖 300+张图片、800+道题目、27个类别。包括基础属性问答、人物地标问答、影视作品问答、视觉推理、反事实推理、诗歌创作、故事写作,商品比较、图片解题等尽可能广泛的类别。 - 为了弥补目前 GPT4 无法直接读取图片的缺陷,我们给所有的带评测图片提供了人工标注的充分详细描述,并且将图片的详细描述、问题和模型的输出结果一起交给 GPT4 打分。 - 评测同时包含英文版本和中文版本。 We evaluated the model's ability from two perspectives: 1. Standard Benchmarks: We evaluate the model's basic task capabilities on four major categories of multimodal tasks: - Zero-shot Caption: Evaluate model's zero-shot image captioning ability on unseen datasets; - General VQA: Evaluate the general question-answering ability of pictures, such as the judgment, color, number, category, etc; - Text-based VQA: Evaluate the model's ability to recognize text in pictures, such as document QA, chart QA, etc; - Referring Expression Comprehension: Evaluate the ability to localize a target object in an image described by a referring expression. 2. TouchStone: To evaluate the overall text-image dialogue capability and alignment level with humans, we have constructed a benchmark called TouchStone, which is based on scoring with GPT4 to evaluate the LVLM model. - The TouchStone benchmark covers a total of 300+ images, 800+ questions, and 27 categories. Such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc; - In order to break the current limitation of GPT4 in terms of direct image input, TouchStone provides fine-grained image annotations by human labeling. These detailed annotations, along with the questions and the model's output, are then presented to GPT4 for scoring. - The benchmark includes both English and Chinese versions. Qwen-VL outperforms current SOTA generalist models on multiple VL tasks and has a more comprehensive coverage in terms of capability range. 零样本图像描述 & 通用视觉问答 (Zero-shot Captioning & General VQA) NoCaps Flickr30K VQAv2 dev OK-VQA GQA SciQA-Img (0-shot) VizWiz (0-shot) Generalist Models Flamingo-9B - 61.5 51.8 44.7 - - 28.8 BLIP-2 (Vicuna-13B) 103.9 71.6 65.0 45.9 32.3 61.0 19.6 InstructBLIP (Vicuna-13B) 121.9 82.8 - - 49.5 63.1 33.4 Qwen-VL (Qwen-7B) 121.4 85.8 78.8 58.6 59.3 67.1 35.2 Previous SOTA (Per Task Fine-tuning) - 127.0 (PALI-17B) 84.5 (InstructBLIP -FlanT5-XL) 86.1 (PALI-X -55B) 66.1 (PALI-X -55B) 72.1 (CFR) 92.53 (LLaVa+ GPT-4) 70.9 (PALI-X -55B) - 在 Zero-shot Caption 中,Qwen-VL 在 Flickr30K 数据集上取得了 SOTA 的结果,并在 Nocaps 数据集上取得了和 InstructBlip 可竞争的结果。 - 在 General VQA 中,Qwen-VL 取得了 LVLM 模型同等量级和设定下 SOTA 的结果。 - For zero-shot image captioning, Qwen-VL achieves the SOTA on Flickr30K and competitive results on Nocaps with InstructBlip. - For general VQA, Qwen-VL achieves the SOTA under the same generalist LVLM scale settings. Model type Model TextVQA DocVQA ChartQA AI2D OCR-VQA Specialist SOTAs (Specialist/Finetuned) PALI-X-55B (Single-task FT) (Without OCR Pipeline) 71.44 80.0 70.0 81.2 75.0 - 在文字相关的识别/问答评测上,取得了当前规模下通用 LVLM 达到的最好结果。 - 分辨率对上述某几个评测非常重要,大部分 224 分辨率的开源 LVLM 模型无法完成以上评测,或只能通过切图的方式解决。Qwen-VL 将分辨率提升到 448,可以直接以端到端的方式进行以上评测。Qwen-VL 在很多任务上甚至超过了 1024 分辨率的 Pic2Struct-Large 模型。 - In text-related recognition/QA evaluation, Qwen-VL achieves the SOTA under the generalist LVLM scale settings. - Resolution is important for several above evaluations. While most open-source LVLM models with 224 resolution are incapable of these evaluations or can only solve these by cutting images, Qwen-VL scales the resolution to 448 so that it can be evaluated end-to-end. Qwen-VL even outperforms Pic2Struct-Large models of 1024 resolution on some tasks. val test-A test-B val test-A test-B val-u test-u refexp OFA-L 79.96 83.67 76.39 68.29 76.00 61.75 67.57 67.58 61.70 Shikra-7B 87.01 90.61 80.24 81.60 87.36 72.12 82.27 82.19 69.34 Shikra-13B 87.83 91.11 81.81 82.89 87.79 74.41 82.64 83.16 69.03 Qwen-VL-7B 89.36 92.26 85.34 83.12 88.25 77.21 85.58 85.48 78.22 Qwen-VL-7B-Chat 88.55 92.27 84.51 82.82 88.59 76.79 85.96 86.32 - Specialist SOTAs (Specialist/Finetuned) G-DINO-L 90.56 93.19 88.24 82.75 88.95 75.92 86.13 87.02 - UNINEXT-H 92.64 94.33 91.46 85.24 89.63 79.79 88.73 89.37 - ONE-PEACE 92.58 94.18 89.26 88.77 92.21 83.23 89.22 89.27 - - 在定位任务上,Qwen-VL 全面超过 Shikra-13B,取得了目前 Generalist LVLM 模型上在 Refcoco 上的 SOTA。 - Qwen-VL 并没有在任何中文定位数据上训练过,但通过中文 Caption 数据和 英文 Grounding 数据的训练,可以 Zero-shot 泛化出中文 Grounding 能力。 我们提供了以上所有评测脚本以供复现我们的实验结果。请阅读 eval/EVALUATION.md 了解更多信息。 - Qwen-VL achieves the SOTA in all above referring expression comprehension benchmarks. - Qwen-VL has not been trained on any Chinese grounding data, but it can still generalize to the Chinese Grounding tasks in a zero-shot way by training Chinese Caption data and English Grounding data. We provide all of the above evaluation scripts for reproducing our experimental results. Please read eval/EVALUATION.md for more information. TouchStone 是一个基于 GPT4 打分来评测 LVLM 模型的图文对话能力和人类对齐水平的基准。它涵盖了 300+张图片、800+道题目、27个类别,包括基础属性、人物地标、视觉推理、诗歌创作、故事写作、商品比较、图片解题等尽可能广泛的类别。关于 TouchStone 的详细介绍,请参考touchstone/READMECN.md了解更多信息。 TouchStone is a benchmark based on scoring with GPT4 to evaluate the abilities of the LVLM model on text-image dialogue and alignment levels with humans. It covers a total of 300+ images, 800+ questions, and 27 categories, such as attribute-based Q&A, celebrity recognition, writing poetry, summarizing multiple images, product comparison, math problem solving, etc. Please read touchstone/READMECN.md for more information. | Model | Score | |---------------|-------| | PandaGPT | 488.5 | | MiniGPT4 | 531.7 | | InstructBLIP | 552.4 | | LLaMA-AdapterV2 | 590.1 | | mPLUG-Owl | 605.4 | | LLaVA | 602.7 | | Qwen-VL-Chat | 645.2 | | Model | Score | |---------------|-------| | VisualGLM | 247.1 | | Qwen-VL-Chat | 401.2 | Qwen-VL-Chat has achieved the best results in both Chinese and English alignment evaluation. If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. 研究人员与开发者可使用Qwen-VL和Qwen-VL-Chat或进行二次开发。我们同样允许商业使用,具体细节请查看LICENSE。如需商用,请填写问卷申请。 Researchers and developers are free to use the codes and model weights of both Qwen-VL and Qwen-VL-Chat. We also allow their commercial use. Check our license at LICENSE for more details. If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :) 如果你想给我们的研发团队和产品团队留言,请通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, feel free to send an email to [email protected].
Qwen1.5-14B-Chat
Qwen2.5-Coder-14B-Instruct-AWQ
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the AWQ-quantized 4-bit instruction-tuned 14B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 14.7B - Number of Paramaters (Non-Embedding): 13.1B - Number of Layers: 48 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: AWQ 4-bit For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our AWQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Math-7B-Instruct
Qwen1.5-4B-Chat
Qwen1.5-1.8B
Qwen3-235B-A22B-Thinking-2507-FP8
Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507-FP8, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. This repo contains the FP8 version of Qwen3-235B-A22B-Thinking-2507, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU1-Retail | 63.9 | 71.8 | 73.9 | 74.8 | - | 54.8 | 67.8 | | TAU1-Airline | 53.5 | 49.2 | 52.0 | 52.0 | - | 26.0 | 46.0 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 --tp 4 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 --tensor-parallel-size 4 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507-FP8 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 4 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-235B-A22B-FP8
Qwen3-Embedding-0.6B-GGUF
The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. Qwen3-Embedding-0.6B-GGUF has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 - Quantization: q80, f16 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. 📌 Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. llama.cpp Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:| | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10| | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33| | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12| | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81| | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61| | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98| | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68| | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80| | Gemini Embedding | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40| | Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17| | Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86| | Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | > Note: For compared models, the scores are retrieved from MTEB online leaderboard on May 24th, 2025. | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. | |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:| | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 | | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 | | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 | | stellaen1.5Bv5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 | | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 | | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 | | Qwen3-Embedding-0.6B | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 | | Qwen3-Embedding-4B | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | 88.72 | 34.39 | | Qwen3-Embedding-8B | 8B | 75.22 | 68.71 | 90.43 | 58.57 | 87.52 | 51.56 | 69.44 | 88.58 | 34.83 | | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------| | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 | | bge-multilingual-gemma2 | 9B | 67.64 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | - | | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 | | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 | | ritrievezhv1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-0.6B | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-3B
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. This repo contains the 3B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., or fill in the middle tasks on this model. For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen1.5-7B-Chat
Qwen2-72B
Qwen1.5-32B
Qwen2.5-72B-Instruct-GPTQ-Int8
Qwen3-Embedding-8B-GGUF
The Qwen3 Embedding series model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. Model Overview Qwen3-Embedding-8B-GGUF has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 8B - Context Length: 32k - Embedding Dimension: Up to 4096, supports user-defined output dimensions ranging from 32 to 4096 - Quantization: q4KM, q50, q5KM, q6K, q80, f16 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. 📌 Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. llama.cpp Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:| | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10| | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33| | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12| | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81| | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61| | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98| | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68| | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80| | gemini-embedding-exp-03-07 | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40| | Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17| | Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86| | Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | > Note: For compared models, the scores are retrieved from MTEB online leaderboard on May 24th, 2025. | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. | |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:| | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 | | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 | | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 | | stellaen1.5Bv5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 | | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 | | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 | | Qwen3-Embedding-0.6B | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 | | Qwen3-Embedding-4B | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | 88.72 | 34.39 | | Qwen3-Embedding-8B | 8B | 75.22 | 68.71 | 90.43 | 58.57 | 87.52 | 51.56 | 69.44 | 88.58 | 34.83 | | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------| | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 | | bge-multilingual-gemma2 | 9B | 67.64 |68.52 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 | | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 | | ritrievezhv1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-0.6B | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | If you find our work helpful, feel free to give us a cite.
Qwen2.5-Math-PRM-7B
Qwen3-14B-FP8
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. This repo contains the FP8 version of Qwen3-14B, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 14.8B - Number of Paramaters (Non-Embedding): 13.2B - Number of Layers: 40 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-14B-FP8 --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-14B-FP8 --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-14B-FP8"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "ropetype": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen2.5-7B-Instruct-GGUF
Qwen2.5-3B-Instruct-AWQ
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the AWQ-quantized 4-bit instruction-tuned 72B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: AWQ 4-bit For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our AWQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
QwQ-32B-Preview
QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities. As a preview release, it demonstrates promising analytical abilities while having several important limitations: 1. Language Mixing and Code-Switching: The model may mix languages or switch between them unexpectedly, affecting response clarity. 2. Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer. 3. Safety and Ethical Considerations: The model requires enhanced safety measures to ensure reliable and secure performance, and users should exercise caution when deploying it. 4. Performance and Benchmark Limitations: The model excels in math and coding but has room for improvement in other areas, such as common sense reasoning and nuanced language understanding. Specification: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 32,768 tokens For more details, please refer to our blog. You can also check Qwen2.5 GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. If you find our work helpful, feel free to give us a cite.
Qwen2-VL-72B-Instruct
Qwen3-0.6B-GGUF
Qwen2.5-0.5B-Instruct-GPTQ-Int4
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the GPTQ-quantized 4-bit instruction-tuned 0.5B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: GPTQ 4-bit For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our GPTQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-235B-A22B-Thinking
Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-235B-A22B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
Qwen1.5-4B
License: other License Name: tongyi-qianwen-research License Link:
Qwen3-14B-GGUF
Qwen2-57B-A14B-Instruct
Qwen1.5-72B-Chat
Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include: 8 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B dense models, and an MoE model of 14B with 2.7B activated; Significant performance improvement in human preference for chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all sizes No need of `trustremotecode`. For more details, please refer to our blog post and GitHub repo. Model Details Qwen1.5 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. For the beta version, temporarily we did not include GQA (except for 32B) and the mixture of SWA and full attention. Training details We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization. Requirements The code of Qwen1.5 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. For quantized models, we advise you to use the GPTQ, AWQ, and GGUF correspondents, namely `Qwen1.5-72B-Chat-GPTQ-Int4`, `Qwen1.5-72B-Chat-GPTQ-Int8`, `Qwen1.5-72B-Chat-AWQ`, and `Qwen1.5-72B-Chat-GGUF`. If you encounter code switching or other bad cases, we advise you to use our provided hyper-parameters in `generationconfig.json`. If you find our work helpful, feel free to give us a cite.
Qwen2-57B-A14B
Language model for text generation.
Qwen2.5-Coder-14B-Instruct-GGUF
Qwen2.5-Coder-14B-Instruct-GPTQ-Int4
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the GPTQ-quantized 4-bit instruction-tuned 14B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 14.7B - Number of Paramaters (Non-Embedding): 13.1B - Number of Layers: 48 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. - Quantization: GPTQ 4-bit For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Also check out our GPTQ documentation for more usage guide. Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen2-1.5B-Instruct-GGUF
Qwen1.5-72B
Qwen3-VL-32B-Thinking-GGUF
This repository provides GGUF-format weights for Qwen3-VL-32B-Thinking, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-32B-Thinking with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen3-4B-GGUF
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-4B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Check out our ollama documentation for more usage guide. You can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the YaRN method. > [!NOTE] > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. > We advise adding the `ropescaling` configuration only when processing long contexts is required. > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0. > [!TIP] > The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - For thinking mode (`enablethinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. - For non-thinking mode (`enablethinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. - We recommend setting `presencepenalty` to 1.5 for quantized models to suppress repetitive outputs. You can adjust the `presencepenalty` parameter between 0 and 2. A higher value may occasionally lead to language mixing and a slight reduction in model performance. 2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." 4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-7B-Instruct-GPTQ-Int8
Qwen2.5-Coder-7B-Instruct-GPTQ-Int4
Qwen3-VL-32B-Thinking-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-32B-Thinking model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original BF16 model. Enjoy! Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking-FP8. Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned! We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this link. Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the community deployment guide. Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally. If you find our work helpful, feel free to give us a cite.
Qwen2-0.5B-Instruct-GGUF
Qwen2-1.5B-Instruct-GPTQ-Int4
Qwen3-32B-GGUF
Qwen2-1.5B-Instruct-AWQ
Qwen2.5-Coder-1.5B-Instruct-GGUF
QwQ-32B-AWQ
Qwen-1_8B-Chat
Qwen3-VL-2B-Instruct-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-2B-Instruct model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original BF16 model. Enjoy! Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Instruct-FP8. Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned! We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this link. Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the community deployment guide. Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally. If you find our work helpful, feel free to give us a cite.
Qwen2-VL-7B
We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation. > [!Important] > This is the base pretrained model of Qwen2-VL-7B without instruction tuning. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. We have three models with 2, 7 and 72 billion parameters. This repo contains the pretrained 7B Qwen2-VL model. The code of Qwen2-VL has been in the latest Hugging Face `transformers` and we advise you to install the latest version with command `pip install -U transformers`, or you might encounter the following error: If you find our work helpful, feel free to give us a cite.
Qwen3-VL-2B-Thinking-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-2B-Thinking model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics are nearly identical to those of the original BF16 model. Enjoy! Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Thinking-FP8. Currently, 🤗 Transformers does not support loading these weights directly. Stay tuned! We recommend deploying the model using vLLM or SGLang, with example launch commands provided below. For details on the runtime environment and deployment, please refer to this link. Here we provide a code snippet demonstrating how to use vLLM to run inference with Qwen3-VL locally. For more details on efficient deployment with vLLM, please refer to the community deployment guide. Here we provide a code snippet demonstrating how to use SGLang to run inference with Qwen3-VL locally. If you find our work helpful, feel free to give us a cite.
Qwen3-4B-FP8
Qwen2.5-Coder-32B-Instruct-GGUF
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the instruction-tuned 32B Qwen2.5-Coder model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 32,768 tokens - Note: Currently, only vLLM supports YARN for length extrapolating. If you want to process sequences up to 131,072 tokens, please refer to non-GGUF models. - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, Documentation, Arxiv. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For large files, we split them into multiple segments due to the limitation of file upload. They share a prefix, with a suffix indicating its index. For examples, `qwen2.5-coder-32b-instruct-q5km-00001-of-00003.gguf`, `qwen2.5-coder-32b-instruct-q5km-00002-of-00003.gguf` and `qwen2.5-coder-32b-instruct-q5km-00003-of-00003.gguf`. The above command will download all of them. 3. (Optional) Merge: For split files, you need to merge them first with the command `llama-gguf-split` as shown below: For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.
Qwen1.5-0.5B-Chat-GGUF
Qwen3Guard-Gen-4B
Qwen3-30B-A3B-GGUF
Qwen3-VL-8B-Thinking-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-8B-Thinking model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics a...
Qwen2.5-Coder-14B
This model is licensed under the Apache 2.0 license. For more information, visit the license link at https://huggingface.co/Qwen/Qwen2.5-Coder-14B/blob/main/LICENSE.
Qwen2-Math-7B-Instruct
Qwen2.5-Coder-0.5B-Instruct-GGUF
Qwen-1_8B-Chat-Int4
Qwen2.5-Coder-32B
Apache 2.0 license. License link: https://huggingface.co/Qwen/Qwen2.5-Coder-32B/blob/main/LICENSE.
Qwen2-72B-Instruct-GPTQ-Int4
Qwen2-7B-Instruct-GGUF
Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the instruction-tuned 7B Qwen2 model. Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc. For more details, please refer to our blog, GitHub, and Documentation. In this repo, we provide `fp16` model and quantized models in the GGUF formats, including `q50`, `q5km`, `q6k` and `q80`. Model Details Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. Training details We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization. Requirements We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. How to use Cloning the repo may be inefficient, and thus you can manually download the GGUF file that you need or use `huggingface-cli` (`pip install huggingfacehub`) as shown below: To run Qwen2, you can use `llama-cli` (the previous `main`) or `llama-server` (the previous `server`). We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example: (Note: `-ngl 28` refers to offloading 24 layers to GPUs, and `-fa` refers to the use of flash attention.) Then it is easy to access the deployed service with OpenAI API: If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the ChatML template. Instead you should use `--in-prefix` and `--in-suffix` to tackle this problem. We implement perplexity evaluation using wikitext following the practice of `llama.cpp` with `./llama-perplexity` (the previous `./perplexity`). In the following we report the PPL of GGUF models of different sizes and different quantization levels. |Size | fp16 | q80 | q6k | q5km | q50 | q4km | q40 | q3km | q2k | iq1m | |--------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------| |0.5B | 15.11 | 15.13 | 15.14 | 15.24 | 15.40 | 15.36 | 16.28 | 15.70 | 16.74 | - | |1.5B | 10.43 | 10.43 | 10.45 | 10.50 | 10.56 | 10.61 | 10.79 | 11.08 | 13.04 | - | |7B | 7.93 | 7.94 | 7.96 | 7.97 | 7.98 | 8.02 | 8.19 | 8.20 | 10.58 | - | |57B-A14B| 6.81 | 6.81 | 6.83 | 6.84 | 6.89 | 6.99 | 7.02 | 7.43 | - | - | |72B | 5.58 | 5.58 | 5.59 | 5.59 | 5.60 | 5.61 | 5.66 | 5.68 | 5.91 | 6.75 | If you find our work helpful, feel free to give us a cite.
Qwen2.5-14B-Instruct-GGUF
Qwen3-VL-30B-A3B-Thinking-GGUF
This repository provides GGUF-format weights for Qwen3-VL-30B-A3B-Thinking, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-4B-Instruct-GGUF
This repository provides GGUF-format weights for Qwen3-VL-4B-Instruct, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-4B-Thinking-FP8
> This repository contains an FP8 quantized version of the Qwen3-VL-4B-Thinking model. The quantization method is fine-grained fp8 quantization with block size of 128, and its performance metrics a...
Qwen3-VL-8B-Thinking-GGUF
This repository provides GGUF-format weights for Qwen3-VL-8B-Thinking, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-8B-Thinking with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen2-VL-2B
Qwen3-VL-2B-Instruct-GGUF
This repository provides GGUF-format weights for Qwen3-VL-2B-Instruct, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen3-VL-30B-A3B-Instruct-GGUF
Qwen-1_8B
QwQ-32B-GGUF
QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini. This repo contains the QwQ 32B model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training (Supervised Finetuning and Reinforcement Learning) - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens - Quantization: q4KM, q50, q5KM, q6K, q80 Note: For the best experience, please review the usage guidelines before deploying QwQ models. You can try our demo or access QwQ models via QwenChat. For more details, please refer to our blog, GitHub, and Documentation. QwQ is based on Qwen2.5, whose code has been in the latest Hugging face `transformers`. We advise you to use the latest version of `transformers`. With `transformers user\nHow many r's are in the word \"strawberry\" \n assistant\n \n" @misc{qwq32b, title = {QwQ-32B: Embracing the Power of Reinforcement Learning}, url = {https://qwenlm.github.io/blog/qwq-32b/}, author = {Qwen Team}, month = {March}, year = {2025} } @article{qwen2.5, title={Qwen2.5 Technical Report}, author={An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tianyi Tang and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu}, journal={arXiv preprint arXiv:2412.15115}, year={2024} } ```
Qwen2.5-7B-Instruct-GPTQ-Int8
Qwen-14B
🤗 Hugging Face    |   🤖 ModelScope    |    📑 Paper    |   🖥️ Demo WeChat (微信)    |    Discord    |    API 通义千问-14B(Qwen-14B)是阿里云研发的通义千问大模型系列的140亿参数规模的模型。Qwen-14B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-14B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-14B-Chat。本仓库为Qwen-14B的仓库。 1. 大规模高质量训练语料:使用超过3万亿tokens的数据进行预训练,包含高质量中、英、多语言、代码、数学等数据,涵盖通用及专业领域的训练语料。通过大量对比实验对预训练语料分布进行了优化。 2. 强大的性能:Qwen-14B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的相近规模开源模型,甚至在部分指标上相比更大尺寸模型也有较强竞争力。具体评测结果请详见下文。 3. 覆盖更全面的词表:相比目前以中英词表为主的开源模型,Qwen-14B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。 Qwen-14B is the 14B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-14B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-14B, we release Qwen-14B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-14B. 1. Large-scale high-quality training corpora: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments. 2. Competitive performance: It significantly surpasses existing open-source models of similar scale on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.), and even surpasses some larger-scale models in several benchmarks. See below for specific evaluation results. 3. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-14B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary. For more details about the open-source model of Qwen-14B, please refer to the GitHub code repository. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项) python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) To run Qwen-14B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. 另外,推荐安装`flash-attention`库(当前已支持flash attention 2),以实现更高的效率和更低的显存占用。 In addition, it is recommended to install the `flash-attention` library (we support flash attention 2 now.) for higher efficiency and lower memory usage. You can easily call the model with the following code: For more information, please refer to our GitHub repo for more information. > 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 基于tiktoken的分词器有别于其他分词器,比如sentencepiece分词器。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅文档。 Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation. The details of the model architecture of Qwen-14B are listed as follows: | Hyperparameter | Value | |:----------------|:-------| | nlayers | 40 | | nheads | 40 | | dmodel | 5120 | | vocab size | 151851 | | sequence length | 2048 | 在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法, 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。 在分词器方面,相比目前主流开源模型以中英词表为主,Qwen-14B使用了超过15万token大小的词表。 该词表在GPT-4使用的BPE词表`cl100kbase`基础上,对中文、多语言进行了优化,在对中、英、代码数据的高效编解码的基础上,对部分多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强。 词表对数字按单个数字位切分。调用较为高效的tiktoken分词库进行分词。 我们从部分语种各随机抽取100万个文档语料,以对比不同模型的编码压缩率(以支持100语种的XLM-R为基准值1,越低越好),具体性能见图。 可以看到Qwen-14B在保持中英代码高效解码的前提下,对部分使用人群较多的语种(泰语th、希伯来语he、阿拉伯语ar、韩语ko、越南语vi、日语ja、土耳其语tr、印尼语id、波兰语pl、俄语ru、荷兰语nl、葡萄牙语pt、意大利语it、德语de、西班牙语es、法语fr等)上也实现了较高的压缩率,使得模型在这些语种上也具备较强的可扩展性和较高的训练和推理效率。 在预训练数据方面,Qwen-14B模型一方面利用了部分开源通用语料, 另一方面也积累了海量全网语料以及高质量文本内容,去重及过滤后的语料超过3T tokens。 囊括全网文本、百科、书籍、代码、数学及各个领域垂类。 For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration). For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-14B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization. We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above. As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-14B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages. For pre-training data, on the one hand, Qwen-14B uses part of the open-source generic corpus. On the other hand, it uses a massive amount of accumulated web corpus and high-quality text content. The scale of corpus reaches over 3T tokens after deduplication and filtration, encompassing web text, encyclopedias, books, code, mathematics, and various domain. 评测效果(Evaluation) 我们选取了MMLU,C-Eval,GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU等目前较流行的benchmark,对模型的中英知识能力、翻译、数学推理、代码等能力进行综合评测。从下列结果可以看到Qwen模型在所有benchmark上均取得了同级别开源模型中的最优表现。 We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the model’s Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks. | Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU | |:-------------------|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:| | | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 5-shot | | LLaMA2-7B | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 31.8 | | LLaMA2-13B | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 38.4 | | LLaMA2-34B | 62.6 | - | 42.2 | 6.2 | 22.6 | 33.0 | 44.1 | - | | ChatGLM2-6B | 47.9 | 51.7 | 32.4 | 6.5 | - | - | 33.7 | - | | InternLM-7B | 51.0 | 53.4 | 31.2 | 6.3 | 10.4 | 14.0 | 37.0 | 51.8 | | InternLM-20B | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | | Baichuan2-7B | 54.7 | 56.3 | 24.6 | 5.6 | 18.3 | 24.2 | 41.6 | 57.1 | | Baichuan2-13B | 59.5 | 59.0 | 52.8 | 10.1 | 17.1 | 30.2 | 49.0 | 62.0 | | Qwen-7B (original) | 56.7 | 59.6 | 51.6 | - | 24.4 | 31.2 | 40.6 | 58.8 | | Qwen-7B | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 62.2 | | Qwen-14B | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 71.0 | 我们引入NTK插值,LogN注意力缩放,窗口注意力等技巧,将Qwen-7B (original)和14B模型的上下文长度从2K扩展到8K以上,将Qwen-7B从8K扩到32K。在arXiv数据上使用PPL指标测试Qwen-7B和Qwen-14B在不同长度下的表现,结果如下: (若要启用NTK和LogN注意力缩放,请将config.json里的`usedynamicntk`和`uselognattn`设置为true) We introduce NTK-aware interpolation, LogN attention scaling, Window attention, etc. to extend the context length to over 8K tokens. We conduct language modeling experiments on the arXiv dataset with the PPL evaluation. Results are demonstrated below: (To use NTK interpolation and LogN scaling, please set `usedynamicntk` and `uselongattn` to true in config.json.) Qwen-7B (original) 4.23 3.78 39.35 469.81 2645.09 - + dynamicntk + logn + windowattn 4.23 3.78 3.58 3.49 4.32 - + dynamicntk + logn + windowattn 4.23 3.81 3.52 3.33 3.22 3.17 + dynamicntk + logn + windowattn - 3.46 3.29 3.18 3.42 - 我们提供了评测脚本,方便大家复现模型效果,详见链接。提示:由于硬件和框架造成的舍入误差,复现结果如有小幅波动属于正常现象。 We have provided evaluation scripts to reproduce the performance of our model, details as link. If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. If you find our work helpful, feel free to give us a cite. 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看LICENSE了解具体的开源协议细节。如需商用,请填写问卷申请。 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply. 如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群、钉钉群以及Discord!同时,也欢迎通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [email protected].
Qwen3-VL-4B-Thinking-GGUF
This repository provides GGUF-format weights for Qwen3-VL-4B-Thinking, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-4B-Thinking with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen2.5-Coder-1.5B-Instruct-AWQ
Qwen2.5-0.5B-Instruct-AWQ
Qwen-14B-Chat
🤗 Hugging Face    |   🤖 ModelScope    |    📑 Paper    |   🖥️ Demo WeChat (微信)    |    Discord    |    API 通义千问-14B(Qwen-14B)是阿里云研发的通义千问大模型系列的140亿参数规模的模型。Qwen-14B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-14B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-14B-Chat。本仓库为Qwen-14B-Chat的仓库。 Qwen-14B is the 14B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-14B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-14B, we release Qwen-14B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-14B-Chat. For more details about the open-source model of Qwen-14B, please refer to the GitHub code repository. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项) python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) To run Qwen-14B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. 另外,推荐安装`flash-attention`库(当前已支持flash attention 2),以实现更高的效率和更低的显存占用。 In addition, it is recommended to install the `flash-attention` library (we support flash attention 2 now.) for higher efficiency and lower memory usage. We show an example of multi-turn interaction with Qwen-14B-Chat in the following code: For more information, please refer to our GitHub repo for more information. 请注意:我们更新量化方案为基于AutoGPTQ的量化,提供Qwen-14B-Chat的Int4量化模型点击这里。相比此前方案,该方案在模型评测效果几乎无损,且存储需求更低,推理速度更优。 Note: we provide a new solution based on AutoGPTQ, and release an Int4 quantized model for Qwen-14B-Chat Click here, which achieves nearly lossless model effects but improved performance on both memory costs and inference speed, in comparison with the previous solution. 以下我们提供示例说明如何使用Int4量化模型。在开始使用前,请先保证满足要求(如torch 2.0及以上,transformers版本为4.32.0及以上,等等),并安装所需安装包: Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: If you meet problems installing `auto-gptq`, we advise you to check out the official repo to find a pre-build wheel. Then you can load the quantized model easily and run inference as same as usual: 我们对BF16,Int8和Int4模型在基准评测上做了测试(使用zero-shot设置),发现量化模型效果损失较小,结果如下所示: We illustrate the zero-shot performance of both BF16, Int8 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below: | Quantization | MMLU | CEval (val) | GSM8K | Humaneval | |--------------|:----:|:-----------:|:-----:|:---------:| | BF16 | 64.6 | 69.8 | 60.1 | 43.9 | | Int8 | 63.6 | 68.6 | 60.0 | 48.2 | | Int4 | 63.3 | 69.0 | 59.8 | 45.7 | 我们测算了不同精度模型以及不同FlashAttn库版本下模型生成2048和8192个token的平均推理速度。如图所示: We measured the average inference speed of generating 2048 and 8192 tokens with different quantization levels and versions of flash-attention, respectively. | Quantization | FlashAttn | Speed (2048 tokens) | Speed (8192 tokens) | | ------------- | :-------: | :------------------:| :------------------:| | BF16 | v2 | 32.88 | 24.87 | | Int8 | v2 | 29.28 | 24.22 | | Int4 | v2 | 38.72 | 27.33 | | BF16 | v1 | 32.76 | 28.89 | | Int8 | v1 | 28.31 | 23.87 | | Int4 | v1 | 37.81 | 26.46 | | BF16 | Disabled | 29.32 | 22.91 | | Int8 | Disabled | 31.12 | 24.60 | | Int4 | Disabled | 37.65 | 26.00 | 具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.8。推理速度是生成8192个token的速度均值。 In detail, the setting of profiling is generating 8192 new tokens with 1 context token. The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1 and CUDA 11.8. The inference speed is averaged over the generated 8192 tokens. 注意:以上Int4/Int8模型生成速度使用autogptq库给出,当前``AutoModelForCausalLM.frompretrained``载入的模型生成速度会慢大约20%。我们已经将该问题汇报给HuggingFace团队,若有解决方案将即时更新。 Note: The generation speed of the Int4/Int8 models mentioned above is provided by the autogptq library. The current speed of the model loaded using "AutoModelForCausalLM.frompretrained" will be approximately 20% slower. We have reported this issue to the HuggingFace team and will update it promptly if a solution is available. 我们还测算了不同模型精度编码2048个token及生成8192个token的峰值显存占用情况。(显存消耗在是否使用FlashAttn的情况下均类似。)结果如下所示: We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under different quantization levels, respectively. (The GPU memory usage is similar when using flash-attention or not.)The results are shown below. | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens | | ------------------ | :---------------------------------: | :-----------------------------------: | | BF16 | 30.15GB | 38.94GB | | Int8 | 18.81GB | 27.54GB | | Int4 | 13.01GB | 21.79GB | The above speed and memory profiling are conducted using this script. The details of the model architecture of Qwen-14B-Chat are listed as follows | Hyperparameter | Value | |:----------------|:------:| | nlayers | 40 | | nheads | 40 | | dmodel | 5120 | | vocab size | 151851 | | sequence length | 2048 | 在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法, 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。 在分词器方面,相比目前主流开源模型以中英词表为主,Qwen-14B-Chat使用了约15万token大小的词表。 该词表在GPT-4使用的BPE词表`cl100kbase`基础上,对中文、多语言进行了优化,在对中、英、代码数据的高效编解码的基础上,对部分多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强。 词表对数字按单个数字位切分。调用较为高效的tiktoken分词库进行分词。 For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration). For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-14B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization. 对于Qwen-14B-Chat模型,我们同样评测了常规的中文理解(C-Eval)、英文理解(MMLU)、代码(HumanEval)和数学(GSM8K)等权威任务,同时包含了长序列任务的评测结果。由于Qwen-14B-Chat模型经过对齐后,激发了较强的外部系统调用能力,我们还进行了工具使用能力方面的评测。 For Qwen-14B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage. Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible. 在C-Eval验证集上,我们评价了Qwen-14B-Chat模型的0-shot & 5-shot准确率 We demonstrate the 0-shot & 5-shot accuracy of Qwen-14B-Chat on C-Eval validation set | Model | Avg. Acc. | |:--------------------------------:|:---------:| | LLaMA2-7B-Chat | 31.9 | | LLaMA2-13B-Chat | 36.2 | | LLaMA2-70B-Chat | 44.3 | | ChatGLM2-6B-Chat | 52.6 | | InternLM-7B-Chat | 53.6 | | Baichuan2-7B-Chat | 55.6 | | Baichuan2-13B-Chat | 56.7 | | Qwen-7B-Chat (original) (0-shot) | 54.2 | | Qwen-7B-Chat (0-shot) | 59.7 | | Qwen-7B-Chat (5-shot) | 59.3 | | Qwen-14B-Chat (0-shot) | 69.8 | | Qwen-14B-Chat (5-shot) | 71.7 | The zero-shot accuracy of Qwen-14B-Chat on C-Eval testing set is provided below: | Model | Avg. | STEM | Social Sciences | Humanities | Others | | :---------------------- | :------: | :--: | :-------------: | :--------: | :----: | | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 | | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - | | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 | | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 | | Qwen-7B-Chat (original) | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 | | Qwen-7B-Chat | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 | | Qwen-14B-Chat | 69.1 | 65.1 | 80.9 | 71.2 | 63.4 | 在14B规模模型上,经过人类指令对齐的Qwen-14B-Chat模型,准确率在同类相近规模模型中仍然处于前列。 Compared with other pretrained models with comparable model size, the human-aligned Qwen-14B-Chat performs well in C-Eval accuracy. MMLU评测集上,Qwen-14B-Chat模型的 0-shot & 5-shot 准确率如下,效果同样在同类对齐模型中同样表现较优。 The 0-shot & 5-shot accuracy of Qwen-14B-Chat on MMLU is provided below. The performance of Qwen-14B-Chat still on the top between other human-aligned models with comparable size. | Model | Avg. Acc. | |:--------------------------------:|:---------:| | ChatGLM2-6B-Chat | 46.0 | | LLaMA2-7B-Chat | 46.2 | | InternLM-7B-Chat | 51.1 | | Baichuan2-7B-Chat | 52.9 | | LLaMA2-13B-Chat | 54.6 | | Baichuan2-13B-Chat | 57.3 | | LLaMA2-70B-Chat | 63.8 | | Qwen-7B-Chat (original) (0-shot) | 53.9 | | Qwen-7B-Chat (0-shot) | 55.8 | | Qwen-7B-Chat (5-shot) | 57.0 | | Qwen-14B-Chat (0-shot) | 64.6 | | Qwen-14B-Chat (5-shot) | 66.5 | The zero-shot Pass@1 of Qwen-14B-Chat on HumanEval is demonstrated below | Model | Pass@1 | |:-----------------------:|:--------:| | ChatGLM2-6B-Chat | 11.0 | | LLaMA2-7B-Chat | 12.2 | | InternLM-7B-Chat | 14.6 | | Baichuan2-7B-Chat | 13.4 | | LLaMA2-13B-Chat | 18.9 | | Baichuan2-13B-Chat | 17.7 | | LLaMA2-70B-Chat | 32.3 | | Qwen-7B-Chat (original) | 24.4 | | Qwen-7B-Chat | 37.2 | | Qwen-14B-Chat | 43.9 | The accuracy of Qwen-14B-Chat on GSM8K is shown below | Model | Acc. | |:--------------------------------:|:--------:| | LLaMA2-7B-Chat | 26.3 | | ChatGLM2-6B-Chat | 28.8 | | Baichuan2-7B-Chat | 32.8 | | InternLM-7B-Chat | 33.0 | | LLaMA2-13B-Chat | 37.1 | | Baichuan2-13B-Chat | 55.3 | | LLaMA2-70B-Chat | 59.3 | | Qwen-7B-Chat (original) (0-shot) | 41.1 | | Qwen-7B-Chat (0-shot) | 50.3 | | Qwen-7B-Chat (8-shot) | 54.1 | | Qwen-14B-Chat (0-shot) | 60.1 | | Qwen-14B-Chat (8-shot) | 59.3 | 通过NTK插值,LogN注意力缩放可以扩展Qwen-14B-Chat的上下文长度。在长文本摘要数据集VCSUM上(文本平均长度在15K左右),Qwen-14B-Chat的Rouge-L结果如下: (若要启用这些技巧,请将config.json里的`usedynamicntk`和`uselognattn`设置为true) We introduce NTK-aware interpolation, LogN attention scaling to extend the context length of Qwen-14B-Chat. The Rouge-L results of Qwen-14B-Chat on long-text summarization dataset VCSUM (The average length of this dataset is around 15K) are shown below: (To use these tricks, please set `usedynamicntk` and `uselongattn` to true in config.json.) | Model | VCSUM (zh) | |:------------------|:----------:| | GPT-3.5-Turbo-16k | 16.0 | | LLama2-7B-Chat | 0.2 | | InternLM-7B-Chat | 13.0 | | ChatGLM2-6B-Chat | 16.3 | | Qwen-14B-Chat | 17.3 | 千问支持通过 ReAct Prompting 调用插件/工具/API。ReAct 也是 LangChain 框架采用的主要方式之一。在我们开源的、用于评估工具使用能力的评测基准上,千问的表现如下: Qwen-Chat supports calling plugins/tools/APIs through ReAct Prompting. ReAct is also one of the main approaches used by the LangChain framework. In our evaluation benchmark for assessing tool usage capabilities, Qwen-Chat's performance is as follows: Model Tool Selection (Acc.↑) Tool Input (Rouge-L↑) False Positive Error↓ > 评测基准中出现的插件均没有出现在千问的训练集中。该基准评估了模型在多个候选插件中选择正确插件的准确率、传入插件的参数的合理性、以及假阳率。假阳率(False Positive)定义:在处理不该调用插件的请求时,错误地调用了插件。 > The plugins that appear in the evaluation set do not appear in the training set of Qwen. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query. 为了考察Qwen使用Python Code Interpreter完成数学解题、数据可视化、及文件处理与爬虫等任务的能力,我们专门建设并开源了一个评测这方面能力的评测基准。 To assess Qwen's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. You can find the benchmark at this link. We have observed that Qwen performs well in terms of code executability and result accuracy when generating code: Model Math↑ Visualization-Hard↑ Visualization-Easy↑ 千问还具备作为 HuggingFace Agent 的能力。它在 Huggingface 提供的run模式评测基准上的表现如下: Qwen-Chat also has the capability to be used as a HuggingFace Agent. Its performance on the run-mode benchmark provided by HuggingFace is as follows: If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. If you find our work helpful, feel free to give us a cite. 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看LICENSE了解具体的开源协议细节。如需商用,欢迎填写问卷申请。 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply. 如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群、钉钉群以及Discord!同时,也欢迎通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [email protected].
Qwen1.5-7B-Chat-GGUF
Qwen1.5-110B-Chat
License: other License Name: tongyi-qianwen License Link:
Qwen-72B
🤗 Hugging Face    |   🤖 ModelScope    |    📑 Paper    |   🖥️ Demo WeChat (微信)    |    Discord    |    API 通义千问-72B(Qwen-72B)是阿里云研发的通义千问大模型系列的720亿参数规模的模型。Qwen-72B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-72B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-72B-Chat。本仓库为Qwen-72B的仓库。 1. 大规模高质量训练语料:使用超过3万亿tokens的数据进行预训练,包含高质量中、英、多语言、代码、数学等数据,涵盖通用及专业领域的训练语料。通过大量对比实验对预训练语料分布进行了优化。 2. 强大的性能:Qwen-72B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的开源模型。具体评测结果请详见下文。 3. 覆盖更全面的词表:相比目前以中英词表为主的开源模型,Qwen-72B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。 4. 较长的上下文支持:Qwen-72B支持32k的上下文长度。 Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-72B. 1. Large-scale high-quality training corpora: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments. 2. Competitive performance: It significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.). See below for specific evaluation results. 3. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary. 4. Longer context support: Qwen-72B supports 32k context length. For more details about the open-source model of Qwen-72B, please refer to the GitHub code repository. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项) 运行BF16或FP16模型需要多卡至少144GB显存(例如2xA100-80G或5xV100-32G);运行Int4模型至少需要48GB显存(例如1xA100-80G或2xV100-32G)。 python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) To run Qwen-72B-Chat in bf16/fp16, at least 144GB GPU memory is required (e.g., 2xA100-80G or 5xV100-32G). To run it in int4, at least 48GB GPU memory is requred (e.g., 1xA100-80G or 2xV100-32G). To run Qwen-72B, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. 另外,推荐安装`flash-attention`库(当前已支持flash attention 2),以实现更高的效率和更低的显存占用。 In addition, it is recommended to install the `flash-attention` library (we support flash attention 2 now.) for higher efficiency and lower memory usage. You can easily call the model with the following code: For more information, please refer to our GitHub repo for more information. > 注:作为术语的“tokenization”在中文中尚无共识的概念对应,本文档采用英文表达以利说明。 基于tiktoken的分词器有别于其他分词器,比如sentencepiece分词器。尤其在微调阶段,需要特别注意特殊token的使用。关于tokenizer的更多信息,以及微调时涉及的相关使用,请参阅文档。 Our tokenizer based on tiktoken is different from other tokenizers, e.g., sentencepiece tokenizer. You need to pay attention to special tokens, especially in finetuning. For more detailed information on the tokenizer and related use in fine-tuning, please refer to the documentation. The details of the model architecture of Qwen-72B are listed as follows: | Hyperparameter | Value | |:----------------|:-------| | nlayers | 80 | | nheads | 64 | | dmodel | 8192 | | vocab size | 151851 | | sequence length | 32768 | 在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法, 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。 在分词器方面,相比目前主流开源模型以中英词表为主,Qwen-72B使用了超过15万token大小的词表。 该词表在GPT-4使用的BPE词表`cl100kbase`基础上,对中文、多语言进行了优化,在对中、英、代码数据的高效编解码的基础上,对部分多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强。 词表对数字按单个数字位切分。调用较为高效的tiktoken分词库进行分词。 我们从部分语种各随机抽取100万个文档语料,以对比不同模型的编码压缩率(以支持100语种的XLM-R为基准值1,越低越好),具体性能见图。 可以看到Qwen-72B在保持中英代码高效解码的前提下,对部分使用人群较多的语种(泰语th、希伯来语he、阿拉伯语ar、韩语ko、越南语vi、日语ja、土耳其语tr、印尼语id、波兰语pl、俄语ru、荷兰语nl、葡萄牙语pt、意大利语it、德语de、西班牙语es、法语fr等)上也实现了较高的压缩率,使得模型在这些语种上也具备较强的可扩展性和较高的训练和推理效率。 在预训练数据方面,Qwen-72B模型一方面利用了部分开源通用语料, 另一方面也积累了海量全网语料以及高质量文本内容,去重及过滤后的语料超过3T tokens。 囊括全网文本、百科、书籍、代码、数学及各个领域垂类。 For position encoding, FFN activation function, and normalization methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration). For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization. We randomly selected 1 million document corpus of each language to test and compare the encoding compression rates of different models (with XLM-R, which supports 100 languages, as the base value 1). The specific performance is shown in the figure above. As can be seen, while ensuring the efficient decoding of Chinese, English, and code, Qwen-72B also achieves a high compression rate for many other languages (such as th, he, ar, ko, vi, ja, tr, id, pl, ru, nl, pt, it, de, es, fr etc.), equipping the model with strong scalability as well as high training and inference efficiency in these languages. For pre-training data, on the one hand, Qwen-72B uses part of the open-source generic corpus. On the other hand, it uses a massive amount of accumulated web corpus and high-quality text content. The scale of corpus reaches over 3T tokens after deduplication and filtration, encompassing web text, encyclopedias, books, code, mathematics, and various domain. 评测效果(Evaluation) 我们选取了MMLU,C-Eval,GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU等目前较流行的benchmark,对模型的中英知识能力、翻译、数学推理、代码等能力进行综合评测。Qwen-72B模型在所有benchmark上均取得了开源模型中的最优表现。 We selected MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, which are currently popular benchmarks, to test the model’s Chinese and English knowledge capabilities, translation, mathematical reasoning, coding and other capabilities. From the following comprehensive evaluation results, we can see that the Qwen model outperform the similarly sized open-source models on all tasks. | Model | Avg | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | AGIEval | GaokaoBench | CMMLU | |:-------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:---------:|:--------:|:--------:|:--------:|:--------:|:--------:| | | | 5-shot | 5-shot | 8-shot | 4-shot | 0-shot | 3-shot | 3-shot | 0-shot | 0-shot | 5-shot | | LLaMA2-7B | 24.4 | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 | 21.8 | 18.9 | 31.8 | | LLaMA2-13B | 31.3 | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 | 30.9 | 18.2 | 38.4 | | LLaMA2-70B | 45.7 | 69.7 | 50.1 | 63.5 | 12.0 | 26.2 | 39.6 | 64.9 | 54.2 | 23.3 | 53.6 | | InternLM-20B | 47.2 | 62.1 | 58.8 | 52.6 | 7.9 | 25.6 | 35.6 | 52.5 | 59.0 | 59.0 | 59.0 | | Yi-34B | 58.0 | 76.3 | 81.8 | 67.9 | 15.9 | 26.2 | 38.2 | 66.4 | 56.5 | 68.3 | 82.6 | | XVERSE-65B | - | 70.8 | 68.6 | 60.3 | - | 26.3 | - | - | - | - | - | | Qwen-7B | 46.2 | 58.2 | 63.5 | 51.7 | 11.6 | 29.9 | 31.6 | 45.0 | 45.3 | 62.5 | 62.2 | | Qwen-14B | 52.7 | 66.3 | 72.1 | 61.3 | 24.8 | 32.3 | 40.8 | 53.4 | 51.9 | 52.7 | 71.0 | | Qwen-72B | 66.4 | 77.4 | 83.3 | 78.9 | 35.2 | 35.4 | 52.2 | 67.7 | 62.5 | 87.6 | 83.6 | Qwen-72B采用扩展RoPE base的训练方法,支持32k的外推长度,我们使用arXiv数据进行语言建模评测,PPL(越低越好)结果如下: Qwen-72B uses the method of extending RoPE base and supports the extrapolation length of 32k. We use arXiv data for language modeling evaluation. The PPL (lower is better) results are as follows: 我们提供了评测脚本,方便大家复现模型效果,详见链接。提示:由于硬件和框架造成的舍入误差,复现结果如有小幅波动属于正常现象。 We have provided evaluation scripts to reproduce the performance of our model, details as link. If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. If you find our work helpful, feel free to give us a cite. 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看LICENSE了解具体的开源协议细节。如需商用,请填写问卷申请。 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply. 如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群、钉钉群以及Discord!同时,也欢迎通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [email protected].
Qwen3-VL-32B-Instruct-GGUF
This repository provides GGUF-format weights for Qwen3-VL-32B-Instruct, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-32B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen1.5-14B-Chat-AWQ
Qwen1.5-72B-Chat-GPTQ-Int4
Qwen1.5-32B-Chat-GPTQ-Int4
Qwen-Audio
Qwen2.5-3B-Instruct-GPTQ-Int4
Qwen2.5-32B-Instruct-GGUF
Qwen2-72B-Instruct-GGUF
Qwen-Audio-Chat
Qwen1.5-1.8B-Chat-GGUF
Qwen2.5-Math-72B-Instruct
Qwen 2.5 Math 72B is based on the Qwen model and is licensed under other license_name.
Qwen2.5-14B-Instruct-GPTQ-Int8
Qwen2.5-Math-7B-PRM800K
Qwen3-1.7B-GPTQ-Int8
Qwen2.5-Coder-32B-Instruct-GPTQ-Int8
CodeQwen1.5-7B
Qwen1.5-72B-Chat-AWQ
Qwen2-VL-2B-Instruct-AWQ
Qwen-VL-Chat-Int4
Qwen2.5-72B-Instruct-GGUF
CodeQwen1.5-7B-Chat-GGUF
Qwen3-0.6B-GPTQ-Int8
CodeQwen1.5-7B-Chat
CodeQwen1.5 is the Code-Specific version of Qwen1.5. It is a transformer-based decoder-only language model pretrained on a large amount of data of codes. Strong code generation capabilities and competitve performance across a series of benchmarks; Supporting long context understanding and generation with the context length of 64K tokens; Supporting 92 coding languages Excellent performance in text-to-SQL, bug fix, etc. For more details, please refer to our blog post and GitHub repo. Model Details CodeQwen1.5 is based on Qwen1.5, a language model series including decoder language models of different model sizes. It is trained on 3 trillion tokens of data of codes, and it includes group query attention (GQA) for efficient inference. Requirements The code of Qwen1.5 has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`, or you might encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. If you encounter code switching or other bad cases, we advise you to use our provided hyper-parameters in `generationconfig.json`. If you find our work helpful, feel free to give us a cite.
Qwen3-4B-SafeRL
Qwen3-4B-SafeRL is a safety-aligned version of the Qwen3-4B model. It has been trained using Reinforcement Learning (RL) with a reward signal from Qwen3Guard-Gen to enhance its robustness against h...
Qwen3Guard-Stream-0.6B
Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that frames safety classification as an instruction-following task, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation. This repository hosts Qwen3Guard-Stream, which offers the following key advantages: Real-Time Detection: Qwen3Guard-Stream is specifically optimized for streaming scenarios, allowing efficient and timely moderation during incremental token generation. Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios. Multilingual Support: Supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications. For more details, please refer to our blog, GitHub, and Technical Report. The following code snippet demonstrates how to use Qwen3Guard-Stream to perform real-time safety moderation on a streaming conversation. > [!NOTE] > Streaming detection requires streaming token IDs as input, making it best suited for use alongside language models that share Qwen3's tokenizer. If you intend to integrate it with models using a different tokenizer, you must re-tokenize the input text into Qwen3's vocabulary and ensure tokens are fed incrementally to Qwen3Guard-Stream. SGLang Install We recommend installing SGLang from source. Run the following commands: SGLang Streaming Safety Moderation Example The following example demonstrates how to use Qwen3Guard-Stream with SGLang to perform real-time safety moderation on streaming conversations: We're currently working on adding support for Qwen3Guard-Stream to vLLM. Stay tuned! In Qwen3Guard, potential harms are classified into three severity levels: Unsafe: Content generally considered harmful across most scenarios. Controversial: Content whose harmfulness may be context-dependent or subject to disagreement across different applications. Safe: Content generally considered safe across most scenarios. In the current version of Qwen3Guard, we consider the following safety categories: Violent: Content that provides detailed instructions, methods, or advice on how to commit acts of violence, including the manufacture, acquisition, or use of weapons. Also includes depictions of violence. Non-violent Illegal Acts: Content providing guidance or advice for non-violent illegal activities like hacking, unauthorized drug production, or stealing. Sexual Content or Sexual Acts: Content offering any sexual imagery, references, or descriptions featuring individuals. Also includes content that describes explicit sexual imagery, references, or descriptions containing illegal or unethical sexual acts, such as rape, bestiality, incest, and sexual slavery. Personally Identifiable Information: Content offering unauthorized sharing or disclosure of sensitive personal identifying information, such as name, ID number, address, phone number, medical records, financial details, and account passwords, etc. Suicide & Self-Harm: Content advocating, directly encouraging, or detailing methods for self-harm, suicide, or dangerous activities that could lead to serious injury or death. Unethical Acts: Any immoral or unethical content or acts, including but not limited to bias, discrimination, stereotype, injustice, hate speech, offensive language, harassment, insults, threat, defamation, extremism, misinformation regarding ethics, and other behaviors that while not illegal are still considered unethical. Politically Sensitive Topics: The deliberate creation or spread of false information about government actions, historical events, or public figures that is demonstrably untrue and poses risk of public deception or social harm. Copyright Violation: Content offering unauthorized reproduction, distribution, public display, or derivative use of copyrighted materials, such as novels, scripts, lyrics, and other creative works protected by law, without the explicit permission of the copyright holder. Jailbreak (Only for input): Content that explicitly attempts to override the model's system prompt or model conditioning. If you find our work helpful, feel free to give us a cite.
Qwen1.5-14B-Chat-GGUF
Qwen2-7B-Instruct-GPTQ-Int4
WorldPM-72B-HelpSteer2
Qwen3-8B-MLX-bf16
Qwen3-1.7B-MLX-bf16
Qwen2.5-0.5B-Instruct-GPTQ-Int8
Qwen1.5-32B-Chat-AWQ
Qwen3-0.6B-MLX-4bit
Qwen1.5-4B-Chat-GGUF
Qwen2-Math-7B
This model is licensed under Apache 2.0 and supports the English language.
Qwen2-VL-2B-Instruct-GPTQ-Int4
WorldPM-72B-UltraFeedback
Qwen2.5-1.5B-Instruct-GPTQ-Int4
Qwen3-VL-2B-Thinking-GGUF
This repository provides GGUF-format weights for Qwen3-VL-2B-Thinking, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen3Guard-Stream-8B
Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that frames safety classification as an instruction-following task, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation. This repository hosts Qwen3Guard-Stream, which offers the following key advantages: Real-Time Detection: Qwen3Guard-Stream is specifically optimized for streaming scenarios, allowing efficient and timely moderation during incremental token generation. Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios. Multilingual Support: Supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications. For more details, please refer to our blog, GitHub, and Technical Report. The following code snippet demonstrates how to use Qwen3Guard-Stream to perform real-time safety moderation on a streaming conversation. > [!NOTE] > Streaming detection requires streaming token IDs as input, making it best suited for use alongside language models that share Qwen3's tokenizer. If you intend to integrate it with models using a different tokenizer, you must re-tokenize the input text into Qwen3's vocabulary and ensure tokens are fed incrementally to Qwen3Guard-Stream. SGLang Install We recommend installing SGLang from source. Run the following commands: SGLang Streaming Safety Moderation Example The following example demonstrates how to use Qwen3Guard-Stream with SGLang to perform real-time safety moderation on streaming conversations: We're currently working on adding support for Qwen3Guard-Stream to vLLM. Stay tuned! In Qwen3Guard, potential harms are classified into three severity levels: Unsafe: Content generally considered harmful across most scenarios. Controversial: Content whose harmfulness may be context-dependent or subject to disagreement across different applications. Safe: Content generally considered safe across most scenarios. In the current version of Qwen3Guard, we consider the following safety categories: Violent: Content that provides detailed instructions, methods, or advice on how to commit acts of violence, including the manufacture, acquisition, or use of weapons. Also includes depictions of violence. Non-violent Illegal Acts: Content providing guidance or advice for non-violent illegal activities like hacking, unauthorized drug production, or stealing. Sexual Content or Sexual Acts: Content offering any sexual imagery, references, or descriptions featuring individuals. Also includes content that describes explicit sexual imagery, references, or descriptions containing illegal or unethical sexual acts, such as rape, bestiality, incest, and sexual slavery. Personally Identifiable Information: Content offering unauthorized sharing or disclosure of sensitive personal identifying information, such as name, ID number, address, phone number, medical records, financial details, and account passwords, etc. Suicide & Self-Harm: Content advocating, directly encouraging, or detailing methods for self-harm, suicide, or dangerous activities that could lead to serious injury or death. Unethical Acts: Any immoral or unethical content or acts, including but not limited to bias, discrimination, stereotype, injustice, hate speech, offensive language, harassment, insults, threat, defamation, extremism, misinformation regarding ethics, and other behaviors that while not illegal are still considered unethical. Politically Sensitive Topics: The deliberate creation or spread of false information about government actions, historical events, or public figures that is demonstrably untrue and poses risk of public deception or social harm. Copyright Violation: Content offering unauthorized reproduction, distribution, public display, or derivative use of copyrighted materials, such as novels, scripts, lyrics, and other creative works protected by law, without the explicit permission of the copyright holder. Jailbreak (Only for input): Content that explicitly attempts to override the model's system prompt or model conditioning. If you find our work helpful, feel free to give us a cite.
Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4
Qwen2.5-Coder-3B-Instruct-GPTQ-Int4
Qwen2.5-1.5B-Instruct-GPTQ-Int8
Qwen2-VL-72B
Qwen2-VL-2B-Instruct-GPTQ-Int8
Qwen2.5-Coder-3B-Instruct-AWQ
Qwen1.5-7B-Chat-AWQ
Qwen1.5-4B-Chat-AWQ
Qwen-7B-Chat-Int4
Qwen-72B-Chat
🤗 Hugging Face    |   🤖 ModelScope    |    📑 Paper    |   🖥️ Demo WeChat (微信)    |    Discord    |    API 通义千问-72B(Qwen-72B)是阿里云研发的通义千问大模型系列的720亿参数规模的模型。Qwen-72B是基于Transformer的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-72B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-72B-Chat。本仓库为Qwen-72B-Chat的仓库。 1. 大规模高质量训练语料:使用超过3万亿tokens的数据进行预训练,包含高质量中、英、多语言、代码、数学等数据,涵盖通用及专业领域的训练语料。通过大量对比实验对预训练语料分布进行了优化。 2. 强大的性能:Qwen-72B在多个中英文下游评测任务上(涵盖常识推理、代码、数学、翻译等),效果显著超越现有的开源模型。具体评测结果请详见下文。 3. 覆盖更全面的词表:相比目前以中英词表为主的开源模型,Qwen-72B使用了约15万大小的词表。该词表对多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强和扩展。 4. 更长的上下文支持:Qwen-72B支持32k的上下文长度。 5. 系统指令跟随:Qwen-72B-Chat可以通过调整系统指令,实现角色扮演,语言风格迁移,任务设定,和行为设定等能力。 Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-72B-Chat. 1. Large-scale high-quality training corpora: It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields. The distribution of the pre-training corpus has been optimized through a large number of ablation experiments. 2. Competitive performance: It significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks (including commonsense, reasoning, code, mathematics, etc.). See below for specific evaluation results. 3. More comprehensive vocabulary coverage: Compared with other open-source models based on Chinese and English vocabularies, Qwen-72B uses a vocabulary of over 150K tokens. This vocabulary is more friendly to multiple languages, enabling users to directly further enhance the capability for certain languages without expanding the vocabulary. 4. Longer context support: Qwen-72B supports 32k context length. 5. System prompt: Qwen-72B can realize roly playing, language style transfer, task setting, and behavior setting by using system prompt. For more details about the open-source model of Qwen-72B, please refer to the GitHub code repository. python 3.8及以上版本 pytorch 1.12及以上版本,推荐2.0及以上版本 建议使用CUDA 11.4及以上(GPU用户、flash-attention用户等需考虑此选项) 运行BF16或FP16模型需要多卡至少144GB显存(例如2xA100-80G或5xV100-32G);运行Int4模型至少需要48GB显存(例如1xA100-80G或2xV100-32G) python 3.8 and above pytorch 1.12 and above, 2.0 and above are recommended CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) To run Qwen-72B-Chat in bf16/fp16, at least 144GB GPU memory is required (e.g., 2xA100-80G or 5xV100-32G). To run it in int4, at least 48GB GPU memory is required (e.g., 1xA100-80G or 2xV100-32G) To run Qwen-72B-Chat, please make sure you meet the above requirements, and then execute the following pip commands to install the dependent libraries. 另外,推荐安装`flash-attention`库(当前已支持flash attention 2),以实现更高的效率和更低的显存占用。 In addition, it is recommended to install the `flash-attention` library (we support flash attention 2 now.) for higher efficiency and lower memory usage. Using vLLM for inference can support longer context lengths and obtain at least twice the generation speedup. You need to meet the following requirements: If you use cuda 12.1 and pytorch 2.1, you can directly use the following command to install vLLM. Otherwise, please refer to the official vLLM Installation Instructions, or our vLLM repo for GPTQ quantization. 使用HuggingFace Transformers进行推理(Inference with Huggingface Transformers) We show an example of multi-turn interaction with Qwen-72B-Chat in the following code: 使用vLLM和类Transformers接口进行推理(Inference with vLLM and Transformers-like APIs) 在根据上方依赖性部分的说明安装vLLM后,可以下载接口封装代码到当前文件夹,并执行以下命令进行多轮对话交互。(注意:该方法当前只支持``model.chat()``接口。) After installing vLLM according to the dependency section above, you can download the wrapper codes and execute the following commands for multiple rounds of dialogue interaction. (Note: It currently only supports the ``model.chat()`` method.) 使用vLLM和类OpenAI接口进行推理(Inference with vLLM and OpenAI-like API) Please refer to the introduction of vLLM deployment and OpenAI interface usage in our GitHub repo. If deploying with 2xA100-80G, you can run the following code: Note that the ``--gpu-memory-utilization 0.98`` parameter is required to avoid OOM problems. For more information, please refer to our GitHub repo for more information. 以下我们提供示例说明如何使用Int4/Int8量化模型。在开始使用前,请先保证满足要求(如torch 2.0及以上,transformers版本为4.32.0及以上,等等),并安装所需安装包: Here we demonstrate how to use our provided quantized models for inference. Before you start, make sure you meet the requirements of auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above, etc.) and install the required packages: If you meet problems installing `auto-gptq`, we advise you to check out the official repo to find a pre-build wheel. > 注意:预编译的`auto-gptq`版本对`torch`版本及其CUDA版本要求严格。同时,由于 > 其近期更新,你可能会遇到`transformers`、`optimum`或`peft`抛出的版本错误。 > 我们建议使用符合以下要求的最新版本: > - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 > - torch>=2.0, =0.5.0, Note: The pre-compiled `auto-gptq` packages strongly depend on the version of `torch` and its CUDA version. Moreover, due to recent update, > you may also encounter unsupported version errors from `transformers`, `optimum`, or `peft`. > We recommend using the latest versions meeting the following requirements : > - torch==2.1 auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1 > - torch>=2.0, =0.5.0, The details of the model architecture of Qwen-72B-Chat are listed as follows | Hyperparameter | Value | |:----------------|:-------| | nlayers | 80 | | nheads | 64 | | dmodel | 8192 | | vocab size | 151851 | | sequence length | 32768 | 在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法, 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。 在分词器方面,相比目前主流开源模型以中英词表为主,Qwen-72B-Chat使用了约15万token大小的词表。 该词表在GPT-4使用的BPE词表`cl100kbase`基础上,对中文、多语言进行了优化,在对中、英、代码数据的高效编解码的基础上,对部分多语言更加友好,方便用户在不扩展词表的情况下对部分语种进行能力增强。 词表对数字按单个数字位切分。调用较为高效的tiktoken分词库进行分词。 For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration). For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-72B-Chat uses a vocabulary of over 150K tokens. It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary. It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization. 对于Qwen-72B-Chat模型,我们同样评测了常规的中文理解(C-Eval)、英文理解(MMLU)、代码(HumanEval)和数学(GSM8K)等权威任务,同时包含了长序列任务的评测结果。由于Qwen-72B-Chat模型经过对齐后,激发了较强的外部系统调用能力,我们还进行了工具使用能力方面的评测。 For Qwen-72B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage. Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible. 在C-Eval验证集上,我们评价了Qwen-72B-Chat模型的0-shot & 5-shot准确率 We demonstrate the 0-shot & 5-shot accuracy of Qwen-72B-Chat on C-Eval validation set | Model | Avg. Acc. | |:--------------------------------:|:---------:| | LLaMA2-7B-Chat | 31.9 | | LLaMA2-13B-Chat | 36.2 | | LLaMA2-70B-Chat | 44.3 | | ChatGPT3.5 | 52.5 | | ChatGPT4 | 69.9 | | Yi-34B-Chat (0-shot) | 77.0 | | Yi-34B-Chat (5-shot) | 78.5 | | Qwen-7B-Chat (original) (0-shot) | 54.2 | | Qwen-7B-Chat (0-shot) | 59.7 | | Qwen-7B-Chat (5-shot) | 59.3 | | Qwen-14B-Chat (0-shot) | 69.8 | | Qwen-14B-Chat (5-shot) | 71.7 | | Qwen-72B-Chat (0-shot) | 80.1 | | Qwen-72B-Chat (5-shot) | 82.9 | The zero-shot accuracy of Qwen-72B-Chat on C-Eval testing set is provided below: | Model | Avg. | STEM | Social Sciences | Humanities | Others | | :---------------------- | :------: | :--: | :-------------: | :--------: | :----: | | Qwen-7B-Chat (original) | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 | | Qwen-7B-Chat | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 | | Qwen-14B-Chat | 69.1 | 65.1 | 80.9 | 71.2 | 63.4 | | Qwen-72B-Chat | 79.5 | 74.5 | 89.1 | 81.2 | 78.1 | MMLU评测集上,Qwen-7B-Chat模型的 0-shot & 5-shot 准确率如下,效果同样在同类对齐模型中同样表现较优。 The 0-shot & 5-shot accuracy of Qwen-72B-Chat on MMLU is provided below. The performance of Qwen-72B-Chat still on the top between other human-aligned models with comparable size. | Model | Avg. Acc. | |:--------------------------------:|:---------:| | LLaMA2-7B-Chat | 46.2 | | LLaMA2-13B-Chat | 54.6 | | LLaMA2-70B-Chat | 63.8 | | Yi-34B-Chat (0-shot) | 67.6 | | Yi-34B-Chat (5-shot) | 73.4 | | ChatGPT3.5 | 69.1 | | ChatGPT4 | 83.0 | | Qwen-7B-Chat (original) (0-shot) | 53.9 | | Qwen-7B-Chat (0-shot) | 55.8 | | Qwen-7B-Chat (5-shot) | 57.0 | | Qwen-14B-Chat (0-shot) | 64.6 | | Qwen-14B-Chat (5-shot) | 66.5 | | Qwen-72B-Chat (0-shot) | 74.3 | | Qwen-72B-Chat (5-shot) | 75.0 | The zero-shot Pass@1 of Qwen-72B-Chat on HumanEval is demonstrated below | Model | Pass@1 | |:-----------------------:|:--------:| | LLaMA2-7B-Chat | 12.2 | | LLaMA2-13B-Chat | 18.9 | | LLaMA2-70B-Chat | 32.3 | | Yi-34B-Chat | 33.5 | | ChatGPT3.5 | 73.2 | | ChatGPT4 | 86.6 | | Qwen-7B-Chat (original) | 24.4 | | Qwen-7B-Chat | 37.2 | | Qwen-14B-Chat | 43.9 | | Qwen-72B-Chat | 64.6 | The accuracy of Qwen-72B-Chat on GSM8K is shown below | Model | Acc. | |:--------------------------------:|:--------:| | LLaMA2-7B-Chat | 26.3 | | LLaMA2-13B-Chat | 37.1 | | LLaMA2-70B-Chat | 59.3 | | Yi-34B-Chat | 71.6 | | ChatGPT3.5 | 73.2 | | ChatGPT4 | 91.4 | | Qwen-7B-Chat (original) (0-shot) | 41.1 | | Qwen-7B-Chat (0-shot) | 50.3 | | Qwen-7B-Chat (8-shot) | 54.1 | | Qwen-14B-Chat (0-shot) | 60.1 | | Qwen-14B-Chat (8-shot) | 59.3 | | Qwen-72B-Chat (0-shot) | 76.4 | | Qwen-72B-Chat (8-shot) | 75.7 | Qwen-72B-Chat supports context lengths of up to 32k. The scores of L-Eval (closed-ended tasks) are as follows: | Model | Average | Coursera | GSM | QuALITY | TOEFL | CodeU | SFcition | |:------------------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| | ChatGPT-3.5-16k | 60.73 | 63.51 | 84.00 | 61.38 | 78.43 | 12.22 | 64.84 | | Qwen-72B-Chat | 62.30 | 58.13 | 76.00 | 77.22 | 86.24 | 6.66 | 69.53 | 我们进一步进行了“大海捞针”实验(想法来自于@Greg Kamradt),测试模型在不同长度的输入下,是否能检索到文章不同位置的信息,结果如下: We conducted the "needle in a haystack" experiment (the idea came from @Greg Kamradt) to test whether the model can retrieve information at different positions in the inputs of different lengths, the result is as follows: 以上结果说明,Qwen-72B-Chat可以能准确检索到32k以内的输入长度中放在各种位置的信息,证明了其具有优秀的长文本处理能力。 The above results show that Qwen-72B-Chat can accurately retrieve information placed in various positions within an input length of 32k, proving its excellent long text understanding capabilities. If you meet problems, please refer to FAQ and the issues first to search a solution before you launch a new issue. If you find our work helpful, feel free to give us a cite. 我们的代码和模型权重对学术研究完全开放,并支持商用。请查看LICENSE了解具体的开源协议细节。如需商用,欢迎填写问卷申请。 Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license. If you have requirements for commercial use, please fill out the form to apply. 如果你想给我们的研发团队和产品团队留言,欢迎加入我们的微信群、钉钉群以及Discord!同时,也欢迎通过邮件([email protected])联系我们。 If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups! Also, feel free to send an email to [email protected].
Qwen2-VL-72B-Instruct-AWQ
QVQ-72B-Preview
QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. | | QVQ-72B-Preview | o1-2024-12-17 | gpt-4o-2024-05-13 | Claude3.5 Sonnet-20241022 | Qwen2VL-72B | |----------------|-----------------|---------------|-------------------|----------------------------|-------------| | MMMU(val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 | | MathVista(mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 | | MathVision(full) | 35.9 | – | 30.4 | 35.6 | 25.9 | | OlympiadBench | 20.4 | – | 25.9 | – | 11.2 | QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems. But It's Not All Perfect: Acknowledging the Limitations While QVQ-72B-Preview exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations: 1. Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses. 2. Recursive Reasoning Loops: There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer. 3. Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model. 4. Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants. Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs. Quickstart We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: If you find our work helpful, feel free to give us a cite.
Qwen3-VL-8B-Instruct-GGUF
This repository provides GGUF-format weights for Qwen3-VL-8B-Instruct, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-8B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen1.5-14B-Chat-GPTQ-Int4
Qwen3-0.6B-MLX-bf16
Qwen3Guard-Stream-4B
Qwen2-72B-Instruct-AWQ
Qwen2-0.5B-Instruct-MLX
Qwen2.5-Omni-7B-GPTQ-Int4
Qwen2.5-Coder-14B-Instruct-GPTQ-Int8
Qwen2-Math-1.5B-Instruct
Qwen3-4B-MLX-8bit
Qwen3-235B-A22B-GGUF
Qwen2.5-Math-72B
Qwen3-0.6B-MLX-8bit
Qwen2-57B-A14B-Instruct-GGUF
Qwen2-7B-Instruct-GPTQ-Int8
Qwen3-VL-235B-A22B-Thinking-GGUF
This repository provides GGUF-format weights for Qwen3-VL-235B-A22B-Thinking, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the GGUF-format weight repository for Qwen3-VL-235B-A22B-Thinking. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.
Qwen2-VL-7B-Instruct-GPTQ-Int8
Qwen2-0.5B-Instruct-AWQ
Qwen1.5-32B-Chat-GGUF
Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int4
Qwen3-32B-MLX-8bit
Qwen3-8B-MLX-4bit
Qwen2.5-3B-Instruct-GPTQ-Int8
Qwen1.5-110B
License: other license name: tongyi-qianwen license link: >-
Qwen3-1.7B-MLX-4bit
Qwen3-14B-MLX-4bit
Qwen3-32B-MLX-bf16
Qwen2-Math-72B-Instruct
This model is licensed under an other license. License name is tongyi-qianwen. For more information, visit the license link at https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/blob/main/LICENSE.
Qwen2-Math-1.5B
Qwen2-Math-72B
Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4
Qwen2.5-Coder-0.5B-Instruct-AWQ
Qwen3-30B-A3B-MLX-4bit
Qwen3-235B-A22B-MLX-4bit
Qwen1.5-7B-Chat-GPTQ-Int4
Qwen-14B-Chat-Int4
Qwen3-1.7B-MLX-8bit
Qwen3-30B-A3B-MLX-8bit
Qwen3-VL-235B-A22B-Instruct-GGUF
This repository provides GGUF-format weights for Qwen3-VL-235B-A22B-Instruct, split into two components: - Language model (LLM): FP16, Q80, Q4KM - Vision encoder (`mmproj`): FP16, Q80 These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights. Enjoy running this multimodal model on your personal device! 🚀 Introduction: Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. To use these models with `llama.cpp`, please ensure you are using the latest version—either by building from source or downloading the most recent release according to the devices. You can run inference via the command line or through a web-based chat interface. For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q80 quantized LLM: To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI: > Tip: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts. Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the official documentation. You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit: For a full list of supported quantization types and detailed instructions, refer to the quantization documentation. If you find our work helpful, feel free to give us a cite.