QuantTrio
Qwen3-30B-A3B-Instruct-2507-GPTQ-Int8
--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507/blob/main/LICENSE pipeline_tag: text-generation tags: - Qwen3 - GPTQ - Int8 - 量化修复 - vLLM base_model: - Qwen/Qwen3-30B-A3B-Instruct-2507 base_model_relation: quantized ---
Qwen3-VL-30B-A3B-Instruct-AWQ
Qwen3-VL-30B-A3B-Instruct-AWQ Base Model: Qwen/Qwen3-VL-30B-A3B-Instruct 【Dependencies / Installation】 As of 2025-10-08, create a fresh Python environment and run: For more details, refer to vLLM Official Qwen3-VL Guide 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `17GB` | `2025-10-04` | Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
GLM-4.6-AWQ
Qwen3-235B-A22B-Instruct-2507-AWQ
Qwen3-235B-A22B-Instruct-2507-AWQ Base model Qwen/Qwen3-235B-A22B-Instruct-2507 【VLLM Launch Command for 8 GPUs (Single Node)】 Note: When launching with 8 GPUs, --enable-expert-parallel must be specified; otherwise, the expert tensors cannot be evenly split across tensor parallel ranks. This option is not required for 4-GPU setups. | File Size | Last Updated | |---------|--------------| | `116GB` | `2025-07-23` | We introduce the updated version of the Qwen3-235B-A22B non-thinking mode, named Qwen3-235B-A22B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-235B-A22B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Claude Opus 4 Non-thinking | Kimi K2 | Qwen3-235B-A22B Non-thinking | Qwen3-235B-A22B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | ---| | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 86.6 | 81.1 | 75.2 | 83.0 | | MMLU-Redux | 90.4 | 91.3 | 94.2 | 92.7 | 89.2 | 93.1 | | GPQA | 68.4 | 66.9 | 74.9 | 75.1 | 62.9 | 77.5 | | SuperGPQA | 57.3 | 51.0 | 56.5 | 57.2 | 48.2 | 62.6 | | SimpleQA | 27.2 | 40.3 | 22.8 | 31.0 | 12.2 | 54.3 | | CSimpleQA | 71.1 | 60.2 | 68.0 | 74.5 | 60.8 | 84.3 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 33.9 | 49.5 | 24.7 | 70.3 | | HMMT25 | 27.5 | 7.9 | 15.9 | 38.8 | 10.0 | 55.4 | | ARC-AGI | 9.0 | 8.8 | 30.3 | 13.3 | 4.3 | 41.8 | | ZebraLogic | 83.4 | 52.6 | - | 89.0 | 37.7 | 95.0 | | LiveBench 20241125 | 66.9 | 63.7 | 74.6 | 76.4 | 62.5 | 75.4 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 44.6 | 48.9 | 32.9 | 51.8 | | MultiPL-E | 82.2 | 82.7 | 88.5 | 85.7 | 79.3 | 87.9 | | Aider-Polyglot | 55.1 | 45.3 | 70.7 | 59.0 | 59.6 | 57.3 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 87.4 | 89.8 | 83.2 | 88.7 | | Arena-Hard v2 | 45.6 | 61.9 | 51.5 | 66.1 | 52.0 | 79.2 | | Creative Writing v3 | 81.6 | 84.9 | 83.8 | 88.1 | 80.4 | 87.5 | | WritingBench | 74.5 | 75.5 | 79.2 | 86.2 | 77.0 | 85.2 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 60.1 | 65.2 | 68.0 | 70.9 | | TAU-Retail | 49.6 | 60.3# | 81.4 | 70.7 | 65.2 | 71.3 | | TAU-Airline | 32.0 | 42.8# | 59.6 | 53.5 | 32.0 | 44.0 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | - | 76.2 | 70.2 | 77.5 | | MMLU-ProX | 75.8 | 76.2 | - | 74.5 | 73.2 | 79.4 | | INCLUDE | 80.1 | 82.1 | - | 76.9 | 75.6 | 79.5 | | PolyMATH | 32.2 | 25.5 | 30.0 | 44.8 | 27.0 | 50.2 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.
Qwen3.5-27B-AWQ
Qwen3-VL-235B-A22B-Instruct-AWQ
MiniMax-M2-AWQ
Qwen3-VL-30B-A3B-Thinking-AWQ
Qwen3-VL-30B-A3B-Thinking-AWQ Base Model: Qwen/Qwen3-VL-30B-A3B-Thinking 【Dependencies / Installation】 As of 2025-10-08, create a fresh Python environment and run: For more details, refer to vLLM Official Qwen3-VL Guide 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `17GB` | `2025-10-04` | Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
GLM-4.7-AWQ
GLM-4.5-Air-AWQ-FP16Mix
GLM-4.5-Air-GPTQ-Int4-Int8Mix
Qwen3-VL-32B-Instruct-AWQ
Qwen3-VL-32B-Instruct-AWQ Base Model: Qwen/Qwen3-VL-32B-Instruct 【Dependencies / Installation】 As of 2025-10-22, create a fresh Python environment and run: 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `20 GiB` | `2025-10-22` | Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
Step3-VL-10B-AWQ
Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix
Qwen3-Coder-480B-A35B-Instruct-GPTQ-Int4-Int8Mix Base model Qwen/Qwen3-Coder-480B-A35B-Instruct 【VLLM Launch Command for 8 GPUs (Single Node)】 Note: Note: When launching with 8 GPUs, --enable-expert-parallel must be specified; otherwise, the expert tensors cannot be evenly split across tensor parallel ranks. This option is not required for 4-GPU setups. | File Size | Last Updated | |---------|--------------| | `261GB` | `2025-07-24` | Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct. featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks, achieving results comparable to Claude Sonnet. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platfrom such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-480B-A35B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 480B in total and 35B activated - Number of Layers: 62 - Number of Attention Heads (GQA): 96 for Q and 8 for KV - Number of Experts: 160 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-480B-A35B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-Coder-30B-A3B-Instruct-GPTQ-Int8
Qwen3.5-35B-A3B-AWQ
Qwen3-30B-A3B-Thinking-2507-AWQ
Qwen3-30B-A3B-Thinking-2507-AWQ Base model Qwen/Qwen3-30B-A3B-Thinking-2507 【vLLM Single Node with 4 GPUs Startup Command】 Note: You must use `--enable-expert-parallel` to start this model, otherwise the expert tensor TP will not divide evenly. This is required even for 2 GPUs. | File Size | Last Updated | |--------|--------------| | `16GB` | `2025-07-31` | Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-30B-A3B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Gemini2.5-Flash-Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B Thinking | Qwen3-30B-A3B-Thinking-2507 | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 81.9 | 82.8 | 78.5 | 80.9 | | MMLU-Redux | 92.1 | 92.7 | 89.5 | 91.4 | | GPQA | 82.8 | 71.1 | 65.8 | 73.4 | | SuperGPQA | 57.8 | 60.7 | 51.8 | 56.8 | | Reasoning | | | | | | AIME25 | 72.0 | 81.5 | 70.9 | 85.0 | | HMMT25 | 64.2 | 62.5 | 49.8 | 71.4 | | LiveBench 20241125 | 74.3 | 77.1 | 74.3 | 76.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 61.2 | 55.7 | 57.4 | 66.0 | | CFEval | 1995 | 2056 | 1940 | 2044 | | OJBench | 23.5 | 25.6 | 20.7 | 25.1 | | Alignment | | | | | | IFEval | 89.8 | 83.4 | 86.5 | 88.9 | | Arena-Hard v2$ | 56.7 | 61.5 | 36.3 | 56.0 | | Creative Writing v3 | 85.0 | 84.6 | 79.1 | 84.4 | | WritingBench | 83.9 | 80.3 | 77.0 | 85.0 | | Agent | | | | | | BFCL-v3 | 68.6 | 70.8 | 69.1 | 72.4 | | TAU1-Retail | 65.2 | 54.8 | 61.7 | 67.8 | | TAU1-Airline | 54.0 | 26.0 | 32.0 | 48.0 | | TAU2-Retail | 66.7 | 40.4 | 34.2 | 58.8 | | TAU2-Airline | 52.0 | 30.0 | 36.0 | 58.0 | | TAU2-Telecom | 31.6 | 21.9 | 22.8 | 26.3 | | Multilingualism | | | | | | MultiIF | 74.4 | 71.9 | 72.2 | 76.4 | | MMLU-ProX | 80.2 | 80.0 | 73.1 | 76.4 | | INCLUDE | 83.9 | 78.7 | 71.9 | 74.4 | | PolyMATH | 49.8 | 54.7 | 46.1 | 52.6 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-30b-a3b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --served-model-name Qwen3-30B-A3B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-30B-A3B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-Coder-30B-A3B-Instruct-AWQ
Qwen3.5-397B-A17B-AWQ
Qwen3-235B-A22B-Thinking-2507-AWQ
Qwen3-235B-A22B-Thinking-2507-AWQ Base model Qwen/Qwen3-235B-A22B-Thinking-2507 【VLLM Launch Command for 8 GPUs (Single Node)】 Note: When launching with 8 GPUs, --enable-expert-parallel must be specified; otherwise, the expert tensors cannot be evenly split across tensor parallel ranks. This option is not required for 4-GPU setups. | File Size | Last Updated | |---------|--------------| | `116GB` | `2025-07-26` | Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-235B-A22B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 262144 --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
GLM-4.5V-AWQ
Qwen3-Coder-480B-A35B-Instruct-AWQ
Qwen3-Coder-480B-A35B-Instruct-AWQ Base model Qwen/Qwen3-Coder-480B-A35B-Instruct 【VLLM Launch Command for 8 GPUs (Single Node)】 Note: When launching with 8 GPUs, --enable-expert-parallel must be specified; otherwise, the expert tensors cannot be evenly split across tensor parallel ranks. This option is not required for 4-GPU setups. | File Size | Last Updated | |---------|--------------| | `236GB` | `2025-07-23` | Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct. featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks, achieving results comparable to Claude Sonnet. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platfrom such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-480B-A35B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 480B in total and 35B activated - Number of Layers: 62 - Number of Attention Heads (GQA): 96 for Q and 8 for KV - Number of Experts: 160 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-480B-A35B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-VL-32B-Thinking-AWQ
Qwen3-VL-32B-Thinking-AWQ Base Model: Qwen/Qwen3-VL-32B-Thinking 【Dependencies / Installation】 As of 2025-10-22, create a fresh Python environment and run: 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `20 GiB` | `2025-10-22` | Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.
Qwen3-VL-235B-A22B-Thinking-AWQ
Qwen3-30B-A3B-Thinking-2507-GPTQ-Int8
Qwen3-30B-A3B-Thinking-2507-GPTQ-Int8 Base model Qwen/Qwen3-30B-A3B-Thinking-2507 【vLLM 4-GPU Single Node Launch Command】 Note: When using 4 GPUs, you must include --enable-expert-parallel because expert tensor TP must be evenly divisible; for 2 GPUs this is not necessary. | File Size | Last Updated | |--------|--------------| | `30GB` | `2025-07-31` | Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-30B-A3B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Gemini2.5-Flash-Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B Thinking | Qwen3-30B-A3B-Thinking-2507 | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 81.9 | 82.8 | 78.5 | 80.9 | | MMLU-Redux | 92.1 | 92.7 | 89.5 | 91.4 | | GPQA | 82.8 | 71.1 | 65.8 | 73.4 | | SuperGPQA | 57.8 | 60.7 | 51.8 | 56.8 | | Reasoning | | | | | | AIME25 | 72.0 | 81.5 | 70.9 | 85.0 | | HMMT25 | 64.2 | 62.5 | 49.8 | 71.4 | | LiveBench 20241125 | 74.3 | 77.1 | 74.3 | 76.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 61.2 | 55.7 | 57.4 | 66.0 | | CFEval | 1995 | 2056 | 1940 | 2044 | | OJBench | 23.5 | 25.6 | 20.7 | 25.1 | | Alignment | | | | | | IFEval | 89.8 | 83.4 | 86.5 | 88.9 | | Arena-Hard v2$ | 56.7 | 61.5 | 36.3 | 56.0 | | Creative Writing v3 | 85.0 | 84.6 | 79.1 | 84.4 | | WritingBench | 83.9 | 80.3 | 77.0 | 85.0 | | Agent | | | | | | BFCL-v3 | 68.6 | 70.8 | 69.1 | 72.4 | | TAU1-Retail | 65.2 | 54.8 | 61.7 | 67.8 | | TAU1-Airline | 54.0 | 26.0 | 32.0 | 48.0 | | TAU2-Retail | 66.7 | 40.4 | 34.2 | 58.8 | | TAU2-Airline | 52.0 | 30.0 | 36.0 | 58.0 | | TAU2-Telecom | 31.6 | 21.9 | 22.8 | 26.3 | | Multilingualism | | | | | | MultiIF | 74.4 | 71.9 | 72.2 | 76.4 | | MMLU-ProX | 80.2 | 80.0 | 73.1 | 76.4 | | INCLUDE | 83.9 | 78.7 | 71.9 | 74.4 | | PolyMATH | 49.8 | 54.7 | 46.1 | 52.6 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-30b-a3b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --served-model-name Qwen3-30B-A3B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-30B-A3B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
DeepSeek-V3.2-Exp-AWQ
Qwen3.5-122B-A10B-AWQ
MiniMax-M2-REAP-162B-A10B-AWQ
Seed-OSS-36B-Instruct-AWQ
MiniMax-M2.5-AWQ
Qwen3-235B-A22B-Instruct-2507-GPTQ-Int4-Int8Mix
GLM-4.6-GPTQ-Int4-Int8Mix
GLM-4.6-GPTQ-Int4-Int8Mix Base Model: zai-org/GLM-4.6 【Dependencies / Installation】 As of 2025-10-01, create a fresh Python environment and run: 【vLLM Startup Command】 Note: When launching with TP=8, include `--enable-expert-parallel`; otherwise the expert tensors couldn’t be evenly sharded across GPU devices. 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `232GB` | `2025-10-03` | 📖 Check out the GLM-4.6 technical blog , technical report(GLM-4.5) , and Zhipu AI technical documentation . Compared with GLM-4.5, GLM-4.6 brings several key improvements: Longer context window: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. Superior coding performance: The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code、Cline、Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. Advanced reasoning: GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. More capable agents: GLM-4.6 exhibits stronger performance in tool using and search-based agents, and integrates more effectively within agent frameworks. Refined writing: Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios. We evaluated GLM-4.6 across eight public benchmarks covering agents, reasoning, and coding. Results show clear gains over GLM-4.5, with GLM-4.6 also holding competitive advantages over leading domestic and international models such as DeepSeek-V3.1-Terminus and Claude Sonnet 4. Both GLM-4.5 and GLM-4.6 use the same inference method. For general evaluations, we recommend using a sampling temperature of 1.0. For code-related evaluation tasks (such as LCB), it is further recommended to set: - For tool-integrated reasoning, please refer to this doc. - For search benchmark, we design a specific format for searching toolcall in thinking mode to support search agent, please refer to this. for the detailed template.
DeepSeek-V3.2-Speciale-AWQ
DeepSeek-V3.1-AWQ
【Dependencies / Installation】 As of 2025-08-27, create a fresh Python environment and run: 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `371GB` | `2025-08-23` | DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects: - Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template. - Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved. - Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3.1-Base | 671B | 37B | 128K | HuggingFace \| ModelScope | | DeepSeek-V3.1 | 671B | 37B | 128K | HuggingFace \| ModelScope | The details of our chat template is described in `tokenizerconfig.json` and `assets/chattemplate.jinja`. Here is a brief description. With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token ` `. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` By concatenating the context and the prefix, we obtain the correct prompt for the query. The prefix of thinking mode is similar to DeepSeek-R1. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the ` ` is retained in every turn of context. ToolCall Toolcall is supported in non-thinking mode. The format is: ` {system prompt}{tooldescription} {query} ` where the tooldescription is Code-Agent We support various code agent frameworks. Please refer to the above toolcall format to create your own code agents. An example is shown in `assets/codeagenttrajectory.html`. Search-Agent We design a specific format for searching toolcall in thinking mode, to support search agent. For complex questions that require accessing external or up-to-date information, DeepSeek-V3.1 can leverage a user-provided search tool through a multi-turn tool-calling process. Please refer to the `assets/searchtooltrajectory.html` and `assets/searchpythontooltrajectory.html` for the detailed template. Evaluation | Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |----------|----------------------------------|-----------------|---|---|---| | General | | | MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 | | MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 | | GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 | | Humanity's Last Exam (Pass@1) | - | - | 15.9 | 17.7 |Search Agent| | | BrowseComp | - | - | 30.0 | 8.9 | | BrowseCompzh | - | - | 49.2 | 35.7 | | Humanity's Last Exam (Python + Search) |- | - | 29.8 | 24.8 | | SimpleQA | - | - | 93.4 | 92.3 | Code | | | LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 | | Codeforces-Div1 (Rating) | - | - | 2091 | 1930 | | Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 | Code Agent| | | SWE Verified (Agent mode) | 66.0 | 45.4 | - | 44.6 | | SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | - | 30.5 | | Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | - | 5.7 | Math | | | AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 | | AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 | | HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 | Note: - Search agents are evaluated with our internal search framework, which uses a commercial search API + webpage filter + 128K context window. Seach agent results of R1-0528 are evaluated with a pre-defined workflow. - SWE-bench is evaluated with our internal code agent framework. The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].
Qwen3-30B-A3B-Thinking-2507-AWQ-BF16Mix
Qwen3-30B-A3B-Thinking-2507-AWQ-BF16Mix Base model Qwen/Qwen3-30B-A3B-Thinking-2507 【vLLM Single Node with 8 GPUs Startup Command】 Note: You must use `--enable-expert-parallel` to start this model, otherwise the expert tensor TP will not divide evenly. This is required even for 2 GPUs. 【❗❗Temporary Patch for vllm==0.10.0❗❗】 The `awqmarlin` module in `vllm` misses checking the `modulestonotconvert` parameter when loading AWQ-MoE modules, which causes mixed quantization of MoE to fail or report errors. Refer to: [[Issue #21888]](https://github.com/vllm-project/vllm/pull/21888) Before the PR is merged, temporarily replace `awqmarlin.py` in `vllm/modelexecutor/layers/quantization/awqmarlin.py`. | File Size | Last Updated | |--------|--------------| | `23GB` | `2025-07-31` | Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-30B-A3B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Gemini2.5-Flash-Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B Thinking | Qwen3-30B-A3B-Thinking-2507 | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 81.9 | 82.8 | 78.5 | 80.9 | | MMLU-Redux | 92.1 | 92.7 | 89.5 | 91.4 | | GPQA | 82.8 | 71.1 | 65.8 | 73.4 | | SuperGPQA | 57.8 | 60.7 | 51.8 | 56.8 | | Reasoning | | | | | | AIME25 | 72.0 | 81.5 | 70.9 | 85.0 | | HMMT25 | 64.2 | 62.5 | 49.8 | 71.4 | | LiveBench 20241125 | 74.3 | 77.1 | 74.3 | 76.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 61.2 | 55.7 | 57.4 | 66.0 | | CFEval | 1995 | 2056 | 1940 | 2044 | | OJBench | 23.5 | 25.6 | 20.7 | 25.1 | | Alignment | | | | | | IFEval | 89.8 | 83.4 | 86.5 | 88.9 | | Arena-Hard v2$ | 56.7 | 61.5 | 36.3 | 56.0 | | Creative Writing v3 | 85.0 | 84.6 | 79.1 | 84.4 | | WritingBench | 83.9 | 80.3 | 77.0 | 85.0 | | Agent | | | | | | BFCL-v3 | 68.6 | 70.8 | 69.1 | 72.4 | | TAU1-Retail | 65.2 | 54.8 | 61.7 | 67.8 | | TAU1-Airline | 54.0 | 26.0 | 32.0 | 48.0 | | TAU2-Retail | 66.7 | 40.4 | 34.2 | 58.8 | | TAU2-Airline | 52.0 | 30.0 | 36.0 | 58.0 | | TAU2-Telecom | 31.6 | 21.9 | 22.8 | 26.3 | | Multilingualism | | | | | | MultiIF | 74.4 | 71.9 | 72.2 | 76.4 | | MMLU-ProX | 80.2 | 80.0 | 73.1 | 76.4 | | INCLUDE | 83.9 | 78.7 | 71.9 | 74.4 | | PolyMATH | 49.8 | 54.7 | 46.1 | 52.6 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-30b-a3b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --served-model-name Qwen3-30B-A3B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-30B-A3B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
Qwen3-VL-235B-A22B-Thinking-FP8
DeepSeek-R1-0528-Qwen3-8B-GPTQ-Int4-Int8Mix
GLM-4.5-AWQ
Qwen3-VL-235B-A22B-Instruct-FP8
MiniMax-M2.1-AWQ
Qwen3-235B-A22B-GPTQ-Int8
Seed-OSS-36B-Instruct-GPTQ-Int8
Seed-OSS-36B-Instruct-GPTQ-Int3
Qwen3-235B-A22B-Thinking-2507-GPTQ-Int4-Int8Mix
Qwen3-235B-A22B-Thinking-2507-GPTQ-Int4-Int8Mix Base model Qwen/Qwen3-235B-A22B-Thinking-2507 【VLLM Launch Command for 8 GPUs (Single Node)】 Note: When launching with 8 GPUs, --enable-expert-parallel must be specified; otherwise, the expert tensors cannot be evenly split across tensor parallel ranks. This option is not required for 4-GPU setups. | File Size | Last Updated | |---------|--------------| | `125GB` | `2025-07-26` | Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-235B-A22B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 262144 --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium
QuantTrio/Kimi-Dev-72B-GPTQ-Int4
Kimi-Dev-72B-GPTQ-Int4 Base model: moonshotai/Kimi-Dev-72B Calibrate using the https://huggingface.co/datasets/timdettmers/openassistant-guanaco/blob/main/openassistantbestreplieseval.jsonl dataset. Introducing Kimi-Dev: A Strong and Open-source Coding LLM for Issue Resolution We introduce Kimi-Dev-72B, our new open-source coding LLM for software engineering tasks. Kimi-Dev-72B achieves a new state-of-the-art on SWE-bench Verified among open-source models. - Kimi-Dev-72B achieves 60.4% performance on SWE-bench Verified. It surpasses the runner-up, setting a new state-of-the-art result among open-source models. - Kimi-Dev-72B is optimized via large-scale reinforcement learning. It autonomously patches real repositories in Docker and gains rewards only when the entire test suite passes. This ensures correct and robust solutions, aligning with real-world development standards. - Kimi-Dev-72B is available for download and deployment on Hugging Face and GitHub. We welcome developers and researchers to explore its capabilities and contribute to development. Performance of Open-source Models on SWE-bench Verified.
KAT-V1-40B-AWQ
【Model Files】 | File Size | Last Updated | |--------|--------------| | `22GB` | `2025-07-31` | - Kwaipilot-AutoThink ranks first among all open-source models on LiveCodeBench Pro, a challenging benchmark explicitly designed to prevent data leakage, and even surpasses strong proprietary systems such as Seed and o3-mini. KAT (Kwaipilot-AutoThink) is an open-source large-language model that mitigates over-thinking by learning when to produce explicit chain-of-thought and when to answer directly. Its development follows a concise two-stage training pipeline: 1. Pre-training Inject knowledge while separating “reasoning” from “direct answering”. Dual-regime data • Think-off queries labeled via a custom tagging system. • Think-on queries generated by a multi-agent solver. Knowledge Distillation + Multi-Token Prediction for fine-grained utility. Base model attains strong factual and reasoning skills without full-scale pre-training costs. 2. Post-training Make reasoning optional and efficient. Cold-start AutoThink — majority vote sets the initial thinking mode. Step-SRPO — intermediate supervision rewards correct mode selection and answer accuracy under that mode. Model triggers CoT only when beneficial, reducing token use and speeding inference. KAT produces responses in a structured template that makes the reasoning path explicit and machine-parsable. Two modes are supported: | Token | Description | |-------|-------------| | ` ` | Analyzes the input to decide whether explicit reasoning is needed. | | ` ` / ` ` | Indicates whether reasoning is activated (“on”) or skipped (“off”). | | ` ` | Marks the start of the chain-of-thought segment when `thinkon` is chosen. | | ` ` | Marks the start of the final user-facing answer. | Looking ahead, we will publish a companion paper that fully documents the AutoThink training framework, covering: Cold-start initialization procedures Reinforcement-learning (Step-SRPO) strategies Data curation and reward design details Training resources – the curated dual-regime datasets and RL codebase Model suite – checkpoints at 1.5B, 7B, and 13B parameters, all trained with AutoThink gating
Seed-OSS-36B-Instruct-GPTQ-Int4
KAT-Dev-GPTQ-Int4
Calibrate using the https://huggingface.co/datasets/timdettmers/openassistant-guanaco/blob/main/openassistantbestreplieseval.jsonl dataset. 🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake platform for free. Highlights KAT-Dev-32B is an open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales. KAT-Dev-32B is optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include: 1. Mid-Training We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench). However, since our experiments are based on the Qwen3-32B model, we find that enhancing these foundational capabilities will have a significant impact on the subsequent SFT and RL stages. This suggests that improving such core abilities can profoundly influence the model’s capacity to handle more complex tasks. 2. SFT & RFT We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model’s generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage. Compared with traditional RL, we incorporate “teacher trajectories” annotated by human engineers as guidance during training—much like a learner driver being assisted by an experienced co-driver before officially driving after getting a license. This step not only boosts model performance but also further stabilizes the subsequent RL training. 3. Agentic RL Scaling Scaling agentic RL hinges on three challenges: efficient learning over nonlinear trajectory histories, leveraging intrinsic model signals, and building scalable high-throughput infrastructure. We address these with a multi-level prefix caching mechanism in the RL training engine, an entropy-based trajectory pruning technique, and an inner implementation of SeamlessFlow[1] architecture that cleanly decouples agents from training while exploiting heterogeneous compute. These innovations together cut scaling costs and enable efficient large-scale RL. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog. claude-code-router is a third-party routing utility that allows Claude Code to flexibly switch between different backend APIs. On the dashScope platform, you can install the claude-code-config extension package, which automatically generates a default configuration for `claude-code-router` with built-in dashScope support. Once the configuration files and plugin directory are generated, the environment required by `ccr` will be ready. If needed, you can still manually edit `~/.claude-code-router/config.json` and the files under `~/.claude-code-router/plugins/` to customize the setup. Finally, simply start `ccr` to run Claude Code and seamlessly connect it with the powerful coding capabilities of KAT-Dev-32B. Happy coding!
GLM-4.1V-9B-Thinking-AWQ
GLM-4.7-GPTQ-Int4-Int8Mix
DeepSeek-V3.2-Exp-AWQ-Lite
DeepSeek-V3.1-AWQ-Lite
【Dependencies / Installation】 As of 2025-08-28, create a fresh Python environment and run: 【Model Files】 | File Size | Last Updated | |-----------|--------------| | `337GB` | `2025-08-28` | DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects: - Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template. - Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved. - Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3.1-Base | 671B | 37B | 128K | HuggingFace \| ModelScope | | DeepSeek-V3.1 | 671B | 37B | 128K | HuggingFace \| ModelScope | The details of our chat template is described in `tokenizerconfig.json` and `assets/chattemplate.jinja`. Here is a brief description. With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token ` `. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` By concatenating the context and the prefix, we obtain the correct prompt for the query. The prefix of thinking mode is similar to DeepSeek-R1. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the ` ` is retained in every turn of context. ToolCall Toolcall is supported in non-thinking mode. The format is: ` {system prompt}{tooldescription} {query} ` where the tooldescription is Code-Agent We support various code agent frameworks. Please refer to the above toolcall format to create your own code agents. An example is shown in `assets/codeagenttrajectory.html`. Search-Agent We design a specific format for searching toolcall in thinking mode, to support search agent. For complex questions that require accessing external or up-to-date information, DeepSeek-V3.1 can leverage a user-provided search tool through a multi-turn tool-calling process. Please refer to the `assets/searchtooltrajectory.html` and `assets/searchpythontooltrajectory.html` for the detailed template. Evaluation | Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |----------|----------------------------------|-----------------|---|---|---| | General | | | MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 | | MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 | | GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 | | Humanity's Last Exam (Pass@1) | - | - | 15.9 | 17.7 |Search Agent| | | BrowseComp | - | - | 30.0 | 8.9 | | BrowseCompzh | - | - | 49.2 | 35.7 | | Humanity's Last Exam (Python + Search) |- | - | 29.8 | 24.8 | | SimpleQA | - | - | 93.4 | 92.3 | Code | | | LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 | | Codeforces-Div1 (Rating) | - | - | 2091 | 1930 | | Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 | Code Agent| | | SWE Verified (Agent mode) | 66.0 | 45.4 | - | 44.6 | | SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | - | 30.5 | | Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | - | 5.7 | Math | | | AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 | | AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 | | HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 | Note: - Search agents are evaluated with our internal search framework, which uses a commercial search API + webpage filter + 128K context window. Seach agent results of R1-0528 are evaluated with a pre-defined workflow. - SWE-bench is evaluated with our internal code agent framework. The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].
KAT-Dev-GPTQ-Int8
Calibrate using the https://huggingface.co/datasets/timdettmers/openassistant-guanaco/blob/main/openassistantbestreplieseval.jsonl dataset. 🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake platform for free. Highlights KAT-Dev-32B is an open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales. KAT-Dev-32B is optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include: 1. Mid-Training We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench). However, since our experiments are based on the Qwen3-32B model, we find that enhancing these foundational capabilities will have a significant impact on the subsequent SFT and RL stages. This suggests that improving such core abilities can profoundly influence the model’s capacity to handle more complex tasks. 2. SFT & RFT We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model’s generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage. Compared with traditional RL, we incorporate “teacher trajectories” annotated by human engineers as guidance during training—much like a learner driver being assisted by an experienced co-driver before officially driving after getting a license. This step not only boosts model performance but also further stabilizes the subsequent RL training. 3. Agentic RL Scaling Scaling agentic RL hinges on three challenges: efficient learning over nonlinear trajectory histories, leveraging intrinsic model signals, and building scalable high-throughput infrastructure. We address these with a multi-level prefix caching mechanism in the RL training engine, an entropy-based trajectory pruning technique, and an inner implementation of SeamlessFlow[1] architecture that cleanly decouples agents from training while exploiting heterogeneous compute. These innovations together cut scaling costs and enable efficient large-scale RL. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog. claude-code-router is a third-party routing utility that allows Claude Code to flexibly switch between different backend APIs. On the dashScope platform, you can install the claude-code-config extension package, which automatically generates a default configuration for `claude-code-router` with built-in dashScope support. Once the configuration files and plugin directory are generated, the environment required by `ccr` will be ready. If needed, you can still manually edit `~/.claude-code-router/config.json` and the files under `~/.claude-code-router/plugins/` to customize the setup. Finally, simply start `ccr` to run Claude Code and seamlessly connect it with the powerful coding capabilities of KAT-Dev-32B. Happy coding!
DeepSeek-R1-0528-Qwen3-8B-Int8-W8A16
GLM-5-AWQ
DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact
QuantTrio/Kimi-Dev-72B-GPTQ-Int8
Kimi-Dev-72B-GPTQ-Int8 Base model: moonshotai/Kimi-Dev-72B Calibrate using the https://huggingface.co/datasets/timdettmers/openassistant-guanaco/blob/main/openassistantbestreplieseval.jsonl dataset. Introducing Kimi-Dev: A Strong and Open-source Coding LLM for Issue Resolution We introduce Kimi-Dev-72B, our new open-source coding LLM for software engineering tasks. Kimi-Dev-72B achieves a new state-of-the-art on SWE-bench Verified among open-source models. - Kimi-Dev-72B achieves 60.4% performance on SWE-bench Verified. It surpasses the runner-up, setting a new state-of-the-art result among open-source models. - Kimi-Dev-72B is optimized via large-scale reinforcement learning. It autonomously patches real repositories in Docker and gains rewards only when the entire test suite passes. This ensures correct and robust solutions, aligning with real-world development standards. - Kimi-Dev-72B is available for download and deployment on Hugging Face and GitHub. We welcome developers and researchers to explore its capabilities and contribute to development. Performance of Open-source Models on SWE-bench Verified.
DeepSeek-R1-0528-Qwen3-8B-Int4-W4A16
GLM-4.5-GPTQ-Int4-Int8Mix
DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Lite
KAT-V1-40B-GPTQ-Int4-Int8Mix
KAT-V1-40B-GPTQ-Int4-Int8Mix Base model Kwaipilot/KAT-V1-40B | File Size | Last Updated | |--------|--------------| | `25GB` | `2025-07-31` | - Kwaipilot-AutoThink ranks first among all open-source models on LiveCodeBench Pro, a challenging benchmark explicitly designed to prevent data leakage, and even surpasses strong proprietary systems such as Seed and o3-mini. KAT (Kwaipilot-AutoThink) is an open-source large-language model that mitigates over-thinking by learning when to produce explicit chain-of-thought and when to answer directly. Its development follows a concise two-stage training pipeline: 1. Pre-training Inject knowledge while separating “reasoning” from “direct answering”. Dual-regime data • Think-off queries labeled via a custom tagging system. • Think-on queries generated by a multi-agent solver. Knowledge Distillation + Multi-Token Prediction for fine-grained utility. Base model attains strong factual and reasoning skills without full-scale pre-training costs. 2. Post-training Make reasoning optional and efficient. Cold-start AutoThink — majority vote sets the initial thinking mode. Step-SRPO — intermediate supervision rewards correct mode selection and answer accuracy under that mode. Model triggers CoT only when beneficial, reducing token use and speeding inference. KAT produces responses in a structured template that makes the reasoning path explicit and machine-parsable. Two modes are supported: | Token | Description | |-------|-------------| | ` ` | Analyzes the input to decide whether explicit reasoning is needed. | | ` ` / ` ` | Indicates whether reasoning is activated (“on”) or skipped (“off”). | | ` ` | Marks the start of the chain-of-thought segment when `thinkon` is chosen. | | ` ` | Marks the start of the final user-facing answer. | Looking ahead, we will publish a companion paper that fully documents the AutoThink training framework, covering: Cold-start initialization procedures Reinforcement-learning (Step-SRPO) strategies Data curation and reward design details Training resources – the curated dual-regime datasets and RL codebase Model suite – checkpoints at 1.5B, 7B, and 13B parameters, all trained with AutoThink gating