cpatonn

137 models • 2 total models in database
Sort by:

Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Inference Please install the latest vllm releases for better support: Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit example usage: Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-Coder-30B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
516,062
24

GLM-4.5-Air-AWQ-4bit

--- license: mit language: - en - zh pipeline_tag: text-generation library_name: transformers base_model: - zai-org/GLM-4.5-Air ---

NaNK
license:mit
291,429
21

Qwen3-Next-80B-A3B-Instruct-AWQ-8bit

NaNK
license:apache-2.0
209,472
2

Qwen3-30B-A3B-Instruct-2507-AWQ-4bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Inference Please install the latest vllm releases for better support: Qwen3-30B-A3B-Instruct-2507-AWQ-4bit example usage: We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
148,481
16

Qwen3-VL-32B-Thinking-AWQ-4bit

- Quantization Method: AWQ - Bits: 4 - Group Size: 32 - Calibration Dataset: 5CD-AI/LLaVA-CoT-o1-Instruct - Quantization Tool: llm-compressor Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
140,636
1

Qwen3-Next-80B-A3B-Instruct-AWQ-4bit

--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/LICENSE pipeline_tag: text-generation base_model: - Qwen/Qwen3-Next-80B-A3B-Instruct ---

NaNK
license:apache-2.0
55,568
47

Qwen3-Next-80B-A3B-Thinking-AWQ-4bit

NaNK
license:apache-2.0
45,395
16

InternVL3_5-38B-AWQ-4bit

NaNK
license:apache-2.0
44,788
0

Qwen3-VL-30B-A3B-Instruct-AWQ-4bit

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
34,650
6

NVIDIA-Nemotron-Nano-12B-v2-AWQ-8bit

The pretraining data has a cutoff date of September 2024. NVIDIA-Nemotron-Nano-12B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks. The model was fine-tuned from NVIDIA-Nemotron-Nano-12B-v2-Base was further compressed into NVIDIA-Nemotron-Nano-9B-v2. The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL. The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen. GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement. We evaluated our model in Reasoning-On mode across all benchmarks, except RULER, which is evaluated in Reasoning-Off mode. | Benchmark | NVIDIA-Nemotron-Nano-12B-v2 | | :---- | ----- | | AIME25 | 76.25% | | MATH500 | 97.75% | | GPQA | 64.48% | | LCB | 70.79% | | BFCL v3 | 66.98% | | IFEVAL-Prompt | 84.70% | | IFEVAL-Instruction | 89.81% | All evaluations were done using NeMo-Skills. We published a tutorial with all details necessary to reproduce our evaluation results. This model supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think". - Architecture Type: Mamba2-Transformer Hybrid - Network Architecture: Nemotron-Hybrid NVIDIA-Nemotron-Nano-12B-v2 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Spanish and Japanese) are also supported. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. - Huggingface 08/29/2025 via https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2 - NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model - Input Type(s): Text - Input Format(s): String - Input Parameters: One-Dimensional (1D): Sequences - Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English. - Output Type(s): Text - Output Format: String - Output Parameters: One-Dimensional (1D): Sequences up to 128K Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. - Runtime Engine(s): NeMo 25.07.nemotron-nano-v2 - Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100 - Operating System(s): Linux The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3). Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True` Case 2: `/nothink` is provided, reasoning will be set to `False` Note: `/think` or `/nothink` keywords can also be provided in “user” messages for turn-level reasoning control. We recommend setting `temperature` to `0.6`, `topp` to `0.95` for reasoning True and greedy search for reasoning False, and increase `maxnewtokens` to `1024` or higher for reasoning True. The snippet below shows how to use this model with TRT-LLM. We tested this on the following commit and followed these instructions to build and install TRT-LLM in a docker container. The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM. Note: - Remember to add \`--mamba\ssm\cache\dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade. - If you encounter a CUDA OOM issue, try `--max-num-seqs 64` and consider lower the value further if the error persists. Alternativly, you can use Docker to launch a vLLM server. The thinking budget allows developers to keep accuracy high and meet response‑time targets \- which is especially crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts. With budget control, you can set a limit for internal reasoning: `maxthinkingtokens`: This is a threshold that will attempt to end the reasoning trace at the next newline encountered in the reasoning trace. If no newline is encountered within 500 tokens, it will abruptly end the reasoning trace at \`max\thinking\tokens \+ 500\`. Calling the server with a budget (Restricted to 32 tokens here as an example) After launching a vLLM server, you can call the server with tool-call support using a Python script like below: We follow the jinja chat template provided below. This template conditionally adds ` \n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds ` ` to the start of the Assistant response if `/nothink` is found in the system prompt. Thus enforcing reasoning on/off behavior. Data Modality: Text Text Training Data Size: More than 10 Trillion Tokens Train/Test/Valid Split: We used 100% of the corpus for pre-training and relied on external benchmarks for testing. Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic Properties: The post-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B. The pre-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately twenty trillion tokens. Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes. More details on the datasets and synthetic data generation methods can be found in the technical report NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model . | Dataset | Collection Period | | :---- | :---- | | Problems in Elementary Mathematics for Home Study | 4/23/2025 | | GSM8K | 4/23/2025 | | PRM800K | 4/23/2025 | | CC-NEWS | 4/23/2025 | | Common Crawl | 4/23/2025 | | Wikimedia | 4/23/2025 | | Bespoke-Stratos-17k | 4/23/2025 | | tigerbot-kaggle-leetcodesolutions-en-2k | 4/23/2025 | | glaive-function-calling-v2 | 4/23/2025 | | APIGen Function-Calling | 4/23/2025 | | LMSYS-Chat-1M | 4/23/2025 | | Open Textbook Library \- CC BY-SA & GNU subset and OpenStax \- CC BY-SA subset | 4/23/2025 | | Advanced Reasoning Benchmark, tigerbot-kaggle-leetcodesolutions-en-2k, PRM800K, and SciBench | 4/23/2025 | | FineWeb-2 | 4/23/2025 | | Court Listener | Legacy Download | | peS2o | Legacy Download | | OpenWebMath | Legacy Download | | BioRxiv | Legacy Download | | PMC Open Access Subset | Legacy Download | | OpenWebText2 | Legacy Download | | Stack Exchange Data Dump | Legacy Download | | PubMed Abstracts | Legacy Download | | NIH ExPorter | Legacy Download | | arXiv | Legacy Download | | BigScience Workshop Datasets | Legacy Download | | Reddit Dataset | Legacy Download | | SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) | Legacy Download | | Public Software Heritage S3 | Legacy Download | | The Stack | Legacy Download | | mC4 | Legacy Download | | Advanced Mathematical Problem Solving | Legacy Download | | MathPile | Legacy Download | | NuminaMath CoT | Legacy Download | | PMC Article | Legacy Download | | FLAN | Legacy Download | | Advanced Reasoning Benchmark | Legacy Download | | SciBench | Legacy Download | | WikiTableQuestions | Legacy Download | | FinQA | Legacy Download | | Riddles | Legacy Download | | Problems in Elementary Mathematics for Home Study | Legacy Download | | MedMCQA | Legacy Download | | Cosmos QA | Legacy Download | | MCTest | Legacy Download | | AI2's Reasoning Challenge | Legacy Download | | OpenBookQA | Legacy Download | | MMLU Auxiliary Train | Legacy Download | | social-chemestry-101 | Legacy Download | | Moral Stories | Legacy Download | | The Common Pile v0.1 | Legacy Download | | FineMath | Legacy Download | | MegaMath | Legacy Download | | FastChat | 6/30/2025 | Private Non-publicly Accessible Datasets of Third Parties | Dataset | | :---- | | Global Regulation | | Workbench | The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper. Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC. The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the technical report). | Dataset | Modality | Dataset Size (Tokens) | Collection Period | | :---- | :---- | :---- | :---- | | English Common Crawl | Text | 3.360T | 4/8/2025 | | Multilingual Common Crawl | Text | 812.7B | 5/1/2025 | | GitHub Crawl | Text | 747.4B | 4/29/2025 | | Dataset | Modality | Dataset Size (Tokens) | Seed Dataset | Model(s) used for generation | | :---- | :---- | :---- | :---- | :---- | | Synthetic Art of Problem Solving from DeepSeek-R1 | Text | 25.5B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; | DeepSeek-R1 | | Synthetic Moral Stories and Social Chemistry from Mixtral-8x22B-v0.1 | Text | 327M | social-chemestry-101; Moral Stories | Mixtral-8x22B-v0.1 | | Synthetic Social Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 83.6M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic Health Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 9.7M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic STEM seeded with OpenStax, Open Textbook Library, and GSM8K from DeepSeek-R1, DeepSeek-V3, DeepSeek-V3-0324, and Qwen2.5-72B | Text | 175M | OpenStax \- CC BY-SA subset; GSM8K; Open Textbook Library \- CC BY-SA & GNU subset | DeepSeek-R1, DeepSeek-V3; DeepSeek-V3-0324; Qwen2.5-72B | | Nemotron-PrismMath | Text | 4.6B | Big-Math-RL-Verified; OpenR1-Math-220k | Qwen2.5-0.5B-instruct, Qwen2.5-72B-Instruct; DeepSeek-R1-Distill-Qwen-32B | | Synthetic Question Answering Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 350M | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic FineMath-4+ Reprocessed from DeepSeek-V3 | Text | 9.2B | Common Crawl | DeepSeek-V3 | | Synthetic FineMath-3+ Reprocessed from phi-4 | Text | 27.6B | Common Crawl | phi-4 | | Synthetic Union-3+ Reprocessed from phi-4 | Text | 93.1B | Common Crawl | phi-4 | | Refreshed Nemotron-MIND from phi-4 | Text | 73B | Common Crawl | phi-4 | | Synthetic Union-4+ Reprocessed from phi-4 | Text | 14.12B | Common Crawl | phi-4 | | Synthetic Union-3+ minus 4+ Reprocessed from phi-4 | Text | 78.95B | Common Crawl | phi-4 | | Synthetic Union-3 Refreshed from phi-4 | Text | 80.94B | Common Crawl | phi-4 | | Synthetic Union-4+ Refreshed from phi-4 | Text | 52.32B | Common Crawl | phi-4 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from DeepSeek-V3 and DeepSeek-V3-0324 | Text | 4.0B | AQUA-RAT; LogiQA; AR-LSAT | DeepSeek-V3; DeepSeek-V3-0324 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | AQUA-RAT; LogiQA; AR-LSAT | Qwen3-30B-A3B | | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; GSM8K; PRM800K | Qwen2.5-32B-Instruct; Qwen2.5-Math-72B; Qwen2.5-Math-7B; Qwen2.5-72B-Instruct | | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | MMLU Auxiliary Train | DeepSeek-R1 | | Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | Common Crawl | Qwen3-30B-A3B; Mistral-NeMo-12B-Instruct | | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | Common Crawl | Qwen3-30B-A3B | | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | Wikimedia | Qwen3-30B-A3B | | Synthetic OpenMathReasoning from DeepSeek-R1-0528 | Text | 1.5M | OpenMathReasoning | DeepSeek-R1-0528 | | Synthetic OpenCodeReasoning from DeepSeek-R1-0528 | Text | 1.1M | OpenCodeReasoning | DeepSeek-R1-0528 | | Synthetic Science Data from DeepSeek-R1-0528 | Text | 1.5M | \- | DeepSeek-R1-0528 | | Synthetic Humanity's Last Exam from DeepSeek-R1-0528 | Text | 460K | Humanity's Last Exam | DeepSeek-R1-0528 | | Synthetic ToolBench from Qwen3-235B-A22B | Text | 400K | ToolBench | Qwen3-235B-A22B | | Synthetic Nemotron Content Safety Dataset V2, eval-safety, Gretel Synthetic Safety Alignment, and RedTeam\2K from DeepSeek-R1-0528 | Text | 52K | Nemotron Content Safety Dataset V2; eval-safety; Gretel Synthetic Safety Alignment; RedTeam\2K | DeepSeek-R1-0528 | | Synthetic HelpSteer from Qwen3-235B-A22B | Text | 120K | HelpSteer3; HelpSteer2 | Qwen3-235B-A22B | | Synthetic Alignment data from Mixtral-8x22B-Instruct-v0.1, Mixtral-8x7B-Instruct-v0.1, and Nemotron-4 Family | Text | 400K | HelpSteer2; C4; LMSYS-Chat-1M; ShareGPT52K; tigerbot-kaggle-leetcodesolutions-en-2k; GSM8K; PRM800K; lm\identity (NVIDIA internal); FinQA; WikiTableQuestions; Riddles; ChatQA nvolve-multiturn (NVIDIA internal); glaive-function-calling-v2; SciBench; OpenBookQA; Advanced Reasoning Benchmark; Public Software Heritage S3; Khan Academy Math Keywords | Nemotron-4-15B-Base (NVIDIA internal); Nemotron-4-15B-Instruct (NVIDIA internal); Nemotron-4-340B-Base; Nemotron-4-340B-Instruct; Nemotron-4-340B-Reward; Mixtral-8x7B-Instruct-v0.1; Mixtral-8x22B-Instruct-v0.1 | | Synthetic LMSYS-Chat-1M from Qwen3-235B-A22B | Text | 1M | LMSYS-Chat-1M | Qwen3-235B-A22B | | Synthetic Multilingual Reasoning data from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, and Qwen2.5-14B-Instruct | Text | 25M | OpenMathReasoning; OpenCodeReasoning | DeepSeek-R1-0528; Qwen2.5-32B-Instruct-AWQ (translation); Qwen2.5-14B-Instruct (translation); | | Synthetic Multilingual Reasoning data from Qwen3-235B-A22B and Gemma 3 Post-Trained models | Text | 5M | WildChat | Qwen3-235B-A22B; Gemma 3 PT 12B; Gemma 3 PT 27B | Data Collection Method by dataset: Hybrid: Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

NaNK
34,006
1

granite-4.0-h-micro-AWQ-4bit

Model Summary: Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
31,560
0

Qwen3-VL-8B-Instruct-AWQ-8bit

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
19,208
1

Qwen3-VL-8B-Instruct-AWQ-4bit

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
18,289
3

InternVL3_5-14B-AWQ-4bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[�� InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | 🤗 link | 🤖 link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | 🤗 link | 🤖 link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | �� link | 🤖 link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | 🤗 link | 🤖 link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-1B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-1B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-2B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-2B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-2B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-4B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-4B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-4B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-8B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-8B-Instruct | CPT + SFT | 🤗 link | �� link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | �� link | 🤖 link | | InternVL3.5-8B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-14B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-14B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-14B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-38B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-38B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-38B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
15,654
4

Llama-3_3-Nemotron-Super-49B-v1_5-AWQ-4bit

NaNK
llama-3
14,767
3

Qwen3-4B-Instruct-2507-AWQ-4bit

NaNK
license:apache-2.0
10,486
4

Qwen3-30B-A3B-Thinking-2507-AWQ-4bit

NaNK
license:apache-2.0
6,485
13

Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit

Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. Key features: State-of-the-art across modalities: Early text-first pretraining and mixed multimodal training provide native multimodal support. While achieving strong audio and audio-video results, unimodal text and image performance does not regress. Reaches SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36; ASR, audio understanding, and voice conversation performance is comparable to Gemini 2.5 Pro. Multilingual: Supports 119 text languages, 19 speech input languages, and 10 speech output languages. - Speech Input: English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, Urdu. - Speech Output: English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean. Novel Architecture: MoE-based Thinker–Talker design with AuT pretraining for strong general representations, plus a multi-codebook design that drives latency to a minimum. Real-time Audio/Video Interaction: Low-latency streaming with natural turn-taking and immediate text or speech responses. Flexible Control: Customize behavior via system prompts for fine-grained control and easy adaptation. Detailed Audio Captioner: Qwen3-Omni-30B-A3B-Captioner is now open source: a general-purpose, highly detailed, low-hallucination audio captioning model that fills a critical gap in the open-source community. Qwen3-Omni supports a wide range of multimodal application scenarios, covering various domain tasks involving audio, image, video, and audio-visual modalities. Below are several cookbooks demonstrating the usage cases of Qwen3-Omni and these cookbooks include our actual execution logs. You can first follow the QuickStart guide to download the model and install the necessary inference environment dependencies, then run and experiment locally—try modifying prompts or switching model types, and enjoy exploring the capabilities of Qwen3-Omni! Audio Speech Recognition Speech recognition, supporting multiple languages and long audio. Speech Translation Speech-to-Text / Speech-to-Speech translation. Music Analysis Detailed analysis and appreciation of any music, including style, genre, rhythm, etc. Sound Analysis Description and analysis of various sound effects and audio signals. Audio Caption Audio captioning, detailed description of any audio input. Mixed Audio Analysis Analysis of mixed audio content, such as speech, music, and environmental sounds. Image Question Answering arbitrary questions about any image. Image Math Solving complex mathematical problems in images, highlighting the capabilities of the Thinking model. Video Description Detailed description of video content. Video Navigation Generating navigation commands from first-person motion videos. Video Scene Transition Analysis of scene transitions in videos. Audio-Visual Audio Visual Question Answering arbitrary questions in audio-visual scenarios, demonstrating the model's ability to model temporal alignment between audio and video. Audio Visual Interaction Interactive communication with the model using audio-visual inputs, including task specification via audio. Audio Visual Dialogue Conversational interaction with the model using audio-visual inputs, showcasing its capabilities in casual chat and assistant-like behavior. Agent Audio Function Call Using audio input to perform function calls, enabling agent-like behaviors. Downstream Task Fine-tuning Omni Captioner Introduction and capability demonstration of Qwen3-Omni-30B-A3B-Captioner , a downstream fine-tuned model based on Qwen3-Omni-30B-A3B-Instruct, illustrating the strong generalization ability of the Qwen3-Omni foundation model. Below is the description of all Qwen3-Omni models. Please select and download the model that fits your needs. | Model Name | Description | |------------------------------|-------------| | Qwen3-Omni-30B-A3B-Instruct | The Instruct model of Qwen3-Omni-30B-A3B, containing both thinker and talker, supporting audio, video, and text input, with audio and text output. For more information, please read the Qwen3-Omni Technical Report. | | Qwen3-Omni-30B-A3B-Thinking | The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. For more information, please read the Qwen3-Omni Technical Report.| | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's cookbook. | During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory: The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you create a new Python environment to avoid environment runtime issues. We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using vLLM for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default. Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. Here is a code snippet to show you how to use Qwen3-Omni with `transformers` and `qwenomniutils`: Here are some more advanced usage examples. You can expand the sections below to learn more. The model can batch inputs composed of mixed samples of various types such as text, images, audio, and videos as input when `returnaudio=False` is set. Here is an example. The model supports both text and audio outputs. If users do not need audio outputs, they can call `model.disabletalker()` after initializing the model. This option will save about `10GB` of GPU memory, but the `returnaudio` option for the `generate` function will only allow `False`. For a more flexible experience, we recommend that users decide whether to return audio when the `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs, resulting in faster text responses. Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows: | Voice Type | Gender | Description | |------------|--------|-------------| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe. | | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. | | Aiden | Male | A warm, laid-back American voice with a gentle, boyish charm. | Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`. We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, and audio output inference support for the Instruct model will be released in the near future, you can follow the commands below to install vLLM from source. Please note that we recommend you create a new Python environment to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the vLLM official documentation. You can use the following code for vLLM inference. The `limitmmperprompt` parameter specifies the maximum number of each modality's data allowed per message. Since vLLM needs to pre-allocate GPU memory, larger values will require more GPU memory; if OOM issues occur, try reducing this value. Setting `tensorparallelsize` greater than one enables multi-GPU parallel inference, improving concurrency and throughput. In addition, `maxnumseqs` indicates the number of sequences that vLLM processes in parallel during each inference step. A larger value requires more GPU memory but enables higher batch inference speed. For more details, please refer to the vLLM official documentation. Below is a simple example of how to run Qwen3-Omni with vLLM: Here are some more advanced usage examples. You can expand the sections below to learn more. Using vLLM enables fast batch inference, which can help you efficiently process large volumes of data or conduct benchmarking. Refer to the following code example: vLLM serve for Qwen3-Omni currently only supports the thinker model. The `useaudioinvideo` parameter is not available in vLLM serve; you can handle this by separately passing video and audio inputs for processing. You can start vLLM serve through the following command: Then you can use the chat API as below (via curl, for example): | Model | Precision | 15s Video | 30s Video | 60s Video | 120s Video | |------------------------------|-----------| --------- | --------- | --------- | --------- | | Qwen3-Omni-30B-A3B-Instruct | BF16 | 78.85 GB | 88.52 GB | 107.74 GB | 144.81 GB | | Qwen3-Omni-30B-A3B-Thinking | BF16 | 68.74 GB | 77.79 GB | 95.76 GB | 131.65 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` precision, tested with `attnimplementation="flashattention2"`. The Instruct model includes both the thinker and talker components, whereas the Thinking model includes only the thinker part. When using Qwen3-Omni for audio-visual multimodal interaction, where the input consists of a video and its corresponding audio (with the audio serving as a query), we recommend using the following system prompt. This setup helps the model maintain high reasoning capability while better assuming interactive roles such as a smart assistant. Additionally, the text generated by the thinker will be more readable, with a natural, conversational tone and without complex formatting that is difficult to vocalize, leading to more stable and fluent audio output from the talker. You can customize the `usersystemprompt` field in the system prompt to include character settings or other role-specific descriptions as needed. The `Qwen3-Omni-30B-A3B-Thinking` model is primarily designed for understanding and interacting with multimodal inputs, including text, audio, image, and video. To achieve optimal performance, we recommend that users include an explicit textual instruction or task description in each round of dialogue alongside the multimodal input. This helps clarify the intent and significantly enhances the model's ability to leverage its reasoning capabilities. For example: In multimodal interaction, user-provided videos are often accompanied by audio (such as spoken questions or sounds from events in the video). This information helps the model provide a better interactive experience. We provide the following options for users to decide whether to use the audio from a video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter must be set consistently across these steps; otherwise, unexpected results may occur. Qwen3-Omni maintains state-of-the-art performance on text and visual modalities without degradation relative to same-size single-model Qwen counterparts. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and sets the SOTA on 22, outperforming strong closed-source systems such as Gemini 2.5 Pro and GPT-4o. GPT-4o-0327 Qwen3-235B-A22B Non Thinking Qwen3-30B-A3B-Instruct-2507 Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Multilingual Tasks MultiIF 70.4 70.2 67.9 64.0 64.7 Gemini-2.5-Flash Thinking Qwen3-235B-A22B Thinking Qwen3-30B-A3B-Thinking-2507 Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Multilingual Tasks MultiIF 74.4 71.9 76.4 72.9 73.2 Seed-ASR Voxtral-Mini Voxtral-Small GPT-4o-Transcribe Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Wenetspeech net | meeting 4.66 | 5.69 24.30 | 31.53 20.33 | 26.08 15.30 | 32.27 14.43 | 13.47 5.91 | 7.65 4.69 | 5.89 4.62 | 5.75 Librispeech clean | other 1.58 | 2.84 1.88 | 4.12 1.56 | 3.30 1.39 | 3.75 2.89 | 3.56 1.74 | 3.45 1.22 | 2.48 1.27 | 2.44 Fleurs-avg (19 lang) - 15.67 8.09 4.48 5.55 14.04 5.33 5.31 MIR-1K (vocal-only) 6.45 23.33 18.73 11.87 9.85 8.15 5.90 5.85 Opencpop-test 2.98 31.01 16.06 7.93 6.49 2.84 1.54 2.02 Fleurs-en2xx - 30.35 37.85 - 39.25 29.22 37.50 36.22 Fleurs-xx2en - 27.54 32.81 - 35.41 28.61 31.08 30.71 Fleurs-zh2xx - 17.03 22.05 - 26.63 17.97 25.17 25.10 Fleurs-xx2zh - 28.75 34.82 - 37.50 27.68 33.13 31.19 GPT-4o-Audio Gemini-2.5-Flash Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Instruct Qwen3-Omni-Flash-Thinking MMAU-v05.15.25 62.5 71.8 77.4 65.5 77.5 75.4 77.6 76.5 Best Specialist Models GPT-4o-Audio Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct RUL-MuchoMusic 47.6 (Audio Flamingo 3) 36.1 49.4 47.3 52.0 52.1 MTG Genre Micro F1 35.8 (MuQ-MuLan) 25.3 32.6 32.5 39.0 39.5 MTG Mood/Theme Micro F1 10.9 (MuQ-MuLan) 11.3 14.1 8.9 21.0 21.7 MTG Instrument Micro F1 39.8 (MuQ-MuLan) 34.2 33.0 22.6 40.5 40.7 MTG Top50 Micro F1 33.2 (MuQ-MuLan) 25.0 26.1 21.6 36.7 36.9 MagnaTagATune Micro F1 41.6 (MuQ) 29.2 28.1 30.1 44.3 46.8 Datasets GPT4-o Gemini-2.0-Flash Qwen2.5-VL 72B Qwen3-Omni-30B-A3B -Instruct Qwen3-Omni-Flash -Instruct Datasets Gemini-2.5-flash-thinking InternVL-3.5-241B-A28B Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Datasets Previous Open-source SoTA Gemini-2.5-Flash Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Datasets Previous Open-source SoTA Gemini-2.5-Flash-Thinking Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Qwen3-Omni-30B-A3B MiniMax ElevenLabs Qwen3-Omni-30B-A3B MiniMax ElevenLabs Decoding Strategy: For the Qwen3-Omni series across all evaluation benchmarks, `Instruct` models use greedy decoding during generation without sampling. For `Thinking` models, the decoding parameters should be taken from the `generationconfig.json` file in the checkpoint. Benchmark-Specific Formatting: For the majority of evaluation benchmarks, they come with their own ChatML formatting to embed the question or prompt. It should be noted that all video data are set to `fps=2` during evaluation. Default Prompts: For tasks in certain benchmarks that do not include a prompt, we use the following prompt settings: | Task Type | Prompt | | :--- | :--- | | Auto Speech Recognition (ASR) for Chinese | 请将这段中文语音转换为纯文本。 | | Auto Speech Recognition (ASR) for Other languages | Transcribe the audio into text. | | Speech-to-Text Translation (S2TT) | Listen to the provided speech and produce a translation in text. | | Song Lyrics Recognition | Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations. | System Prompt: No `system prompt` should be set for any evaluation benchmark. Input Sequence: The question or prompt should be input as user text. Unless otherwise specified by the benchmark, the text should come after multimodal data in the sequence. For example:

NaNK
5,883
25

Devstral-Small-2507-AWQ-4bit

Method Quantised using casper-hansen/AutoAWQ and the following configs: Inference The quantised model's configs and weights are stored in hf and safetensors format, but the tokeniser remains in mistral format. Please load inference arguments accordingly, e.g.,:

NaNK
license:apache-2.0
3,437
7

Qwen3-VL-4B-Instruct-AWQ-4bit

NaNK
license:apache-2.0
3,163
1

NVIDIA-Nemotron-Nano-12B-v2-AWQ-4bit

NaNK
2,935
3

Qwen3-VL-8B-Thinking-AWQ-4bit

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
2,442
1

Qwen3-Omni-30B-A3B-Captioner-AWQ-4bit

NaNK
2,342
2

Qwen3-VL-32B-Instruct-AWQ-4bit

- Quantization Method: AWQ - Bits: 4 - Group Size: 32 - Calibration Dataset: HuggingFaceM4/FineVision - Quantization Tool: llm-compressor Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
2,238
1

granite-4.0-h-small-AWQ-4bit

- Quantization method: AWQ - Bits: 4 - Group Size: 32 - Calibration Dataset: nvidia/Llama-Nemotron-Post-Training-Dataset - Quantization Tool: llm-compressor - The model can not load with tensor parallelism and pipeline parallelism. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Small model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Small comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Small model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Small baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
2,232
0

Qwen3-VL-30B-A3B-Thinking-AWQ-4bit

NaNK
license:apache-2.0
2,100
5

NVIDIA-Nemotron-Nano-9B-v2-AWQ-4bit

NaNK
1,672
2

Magistral-Small-2509-AWQ-4bit

NaNK
license:apache-2.0
1,661
1

Qwen3-Omni-30B-A3B-Thinking-AWQ-4bit

NaNK
1,344
4

Ling-flash-2.0-AWQ-8bit

NaNK
license:mit
1,206
1

Qwen3-30B-A3B-Instruct-2507-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Inference Please install the latest vllm releases for better support: Qwen3-30B-A3B-Instruct-2507-AWQ-8bit example usage: We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
886
2

Qwen3-VL-8B-Thinking-AWQ-8bit

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
844
2

Qwen3-4B-Thinking-2507-AWQ-4bit

NaNK
license:apache-2.0
799
1

Qwen3-Coder-30B-A3B-Instruct-GPTQ-4bit

NaNK
license:apache-2.0
780
5

GLM-4.5V-AWQ-4bit

NaNK
license:mit
736
3

Qwen3-Omni-30B-A3B-Instruct-AWQ-8bit

NaNK
726
3

Qwen3-Coder-30B-A3B-Instruct-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Inference Please install the latest vllm releases for better support: Qwen3-Coder-30B-A3B-Instruct-AWQ-8bit example usage: Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-Coder-30B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
667
1

Magistral-Small-2507-AWQ-4bit

NaNK
license:apache-2.0
654
0

Qwen3-VL-4B-Thinking-AWQ-4bit

NaNK
license:apache-2.0
595
0

granite-4.0-h-tiny-AWQ-4bit

- Quantization method: AWQ - Bits: 4 - Group Size: 32 - Calibration Dataset: nvidia/Llama-Nemotron-Post-Training-Dataset - Quantization Tool: llm-compressor - The model can not load with tensor parallelism and pipeline parallelism. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
580
1

GLM-4.5V-AWQ-8bit

NaNK
license:mit
551
2

Apriel-1.5-15b-Thinker-AWQ-8bit

Apriel-1.5-15b-Thinker - Mid training is all you need! 1. Summary 2. Evaluation 3. Training Details 4. How to Use 5. Intended Use 6. Limitations 7. Security and Responsible Use 8. Software 9. License 10. Acknowledgements 11. Citation Click here to skip to the technical report -> https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf Apriel-1.5-15b-Thinker is a multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves competitive performance against models 10 times it's size. Apriel-1.5 is the second model in the reasoning series. It introduces enhanced textual reasoning capabilities and adds image reasoning support to the previous text model. It has undergone extensive continual pretraining across both text and image domains. In terms of post-training this model has undergone text-SFT only. Our research demonstrates that with a strong mid-training regimen, we are able to achive SOTA performance on text and image reasoning tasks without having any image SFT training or RL. Highlights - Achieves a score of 52 on the Artificial Analysis index and is competitive with Deepseek R1 0528, Gemini-Flash etc. - It is AT LEAST 1 / 10 the size of any other model that scores > 50 on the Artificial Analysis index. - Scores 68 on Tau2 Bench Telecom and 62 on IFBench, which are key benchmarks for the enterprise domain. - At 15B parameters, the model fits on a single GPU, making it highly memory-efficient. - For text benchmarks, we report evaluations perforomed by a third party - Artificial Analysis. - For image benchmarks, we report evaluations obtained by https://github.com/open-compass/VLMEvalKit Mid training / Continual Pre‑training In this stage, the model is trained on billions of tokens of carefully curated textual samples drawn from mathematical reasoning, coding challenges, scientific discourse, logical puzzles, and diverse knowledge-rich texts along with multimodal samples covering image understanding and reasoning, captioning, and interleaved image-text data. The objective is to strengthen foundational reasoning capabilities of the model. This stage is critical for the model to function as a reasoner and provides significant lifts in reasoning benchmarks. Supervised Fine‑Tuning (SFT) The model is fine-tuned on over 2M high-quality text samples spanning mathematical and scientific problem-solving, coding tasks, instruction-following, API/function invocation, and conversational use cases. This results in superior text performance comparable to models such as Deepseek R1 0528 and Gemini-Flash. Although no image-specific fine-tuning is performed, the model’s inherent multimodal capabilities and cross-modal transfer of reasoning behavior from the text SFT yield competitive image performance relative to other leading open-source VL models. As the upstream PR is not yet merged, you can use this custom image as an alternate way to run the model with tool and reasoning parsers enabled. This will start the vLLM OpenAI-compatible API server serving the Apriel-1.5-15B-Thinker model with Apriel’s custom tool parser and reasoning parser. Here is a code snippet demonstrating the model's usage with the transformers library's generate function: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.

NaNK
license:mit
549
1

Apriel-1.5-15b-Thinker-AWQ-4bit

Apriel-1.5-15b-Thinker - Mid training is all you need! 1. Summary 2. Evaluation 3. Training Details 4. How to Use 5. Intended Use 6. Limitations 7. Security and Responsible Use 8. Software 9. License 10. Acknowledgements 11. Citation Click here to skip to the technical report -> https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf Apriel-1.5-15b-Thinker is a multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves competitive performance against models 10 times it's size. Apriel-1.5 is the second model in the reasoning series. It introduces enhanced textual reasoning capabilities and adds image reasoning support to the previous text model. It has undergone extensive continual pretraining across both text and image domains. In terms of post-training this model has undergone text-SFT only. Our research demonstrates that with a strong mid-training regimen, we are able to achive SOTA performance on text and image reasoning tasks without having any image SFT training or RL. Highlights - Achieves a score of 52 on the Artificial Analysis index and is competitive with Deepseek R1 0528, Gemini-Flash etc. - It is AT LEAST 1 / 10 the size of any other model that scores > 50 on the Artificial Analysis index. - Scores 68 on Tau2 Bench Telecom and 62 on IFBench, which are key benchmarks for the enterprise domain. - At 15B parameters, the model fits on a single GPU, making it highly memory-efficient. - For text benchmarks, we report evaluations perforomed by a third party - Artificial Analysis. - For image benchmarks, we report evaluations obtained by https://github.com/open-compass/VLMEvalKit Mid training / Continual Pre‑training In this stage, the model is trained on billions of tokens of carefully curated textual samples drawn from mathematical reasoning, coding challenges, scientific discourse, logical puzzles, and diverse knowledge-rich texts along with multimodal samples covering image understanding and reasoning, captioning, and interleaved image-text data. The objective is to strengthen foundational reasoning capabilities of the model. This stage is critical for the model to function as a reasoner and provides significant lifts in reasoning benchmarks. Supervised Fine‑Tuning (SFT) The model is fine-tuned on over 2M high-quality text samples spanning mathematical and scientific problem-solving, coding tasks, instruction-following, API/function invocation, and conversational use cases. This results in superior text performance comparable to models such as Deepseek R1 0528 and Gemini-Flash. Although no image-specific fine-tuning is performed, the model’s inherent multimodal capabilities and cross-modal transfer of reasoning behavior from the text SFT yield competitive image performance relative to other leading open-source VL models. As the upstream PR is not yet merged, you can use this custom image as an alternate way to run the model with tool and reasoning parsers enabled. This will start the vLLM OpenAI-compatible API server serving the Apriel-1.5-15B-Thinker model with Apriel’s custom tool parser and reasoning parser. Here is a code snippet demonstrating the model's usage with the transformers library's generate function: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.

NaNK
license:mit
517
2

GLM-4.5-Air-GPTQ-4bit

Method Quantised using vllm-project/llm-compressor, nvidia/Llama-Nemotron-Post-Training-Dataset and the following configs: Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations. Inference vllm Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensorparallelsize 📍 Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . Model Introduction The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
491
3

Kimi-Dev-72B-AWQ-4bit

NaNK
license:mit
488
2

GLM-4.5-Air-AWQ-8bit

vllm Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensorparallelsize 📍 Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . Model Introduction The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
454
2

KAT-Dev-AWQ-4bit

Highlights KAT-Dev-32B is an open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales. KAT-Dev-32B is optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include: 1. Mid-Training We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench). However, since our experiments are based on the Qwen3-32B model, we find that enhancing these foundational capabilities will have a significant impact on the subsequent SFT and RL stages. This suggests that improving such core abilities can profoundly influence the model’s capacity to handle more complex tasks. 2. SFT & RFT We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model’s generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage. Compared with traditional RL, we incorporate “teacher trajectories” annotated by human engineers as guidance during training—much like a learner driver being assisted by an experienced co-driver before officially driving after getting a license. This step not only boosts model performance but also further stabilizes the subsequent RL training. 3. Agentic RL Scaling Scaling agentic RL hinges on three challenges: efficient learning over nonlinear trajectory histories, leveraging intrinsic model signals, and building scalable high-throughput infrastructure. We address these with a multi-level prefix caching mechanism in the RL training engine, an entropy-based trajectory pruning technique, and an inner implementation of SeamlessFlow[1] architecture that cleanly decouples agents from training while exploiting heterogeneous compute. These innovations together cut scaling costs and enable efficient large-scale RL. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog. claude-code-router is a third-party routing utility that allows Claude Code to flexibly switch between different backend APIs. On the dashScope platform, you can install the claude-code-config extension package, which automatically generates a default configuration for `claude-code-router` with built-in dashScope support. Once the configuration files and plugin directory are generated, the environment required by `ccr` will be ready. If needed, you can still manually edit `~/.claude-code-router/config.json` and the files under `~/.claude-code-router/plugins/` to customize the setup. Finally, simply start `ccr` to run Claude Code and seamlessly connect it with the powerful coding capabilities of KAT-Dev-32B. Happy coding!

NaNK
430
0

Hermes-4-70B-AWQ-4bit

NaNK
llama
426
5

Qwopus3.5-27B-v3-AWQ-INT8-INT4

NaNK
license:apache-2.0
419
0

Qwen3-VL-4B-Instruct-AWQ-8bit

NaNK
license:apache-2.0
395
1

Qwen3-Next-80B-A3B-Thinking-AWQ-8bit

NaNK
license:apache-2.0
391
3

KAT-Dev-72B-Exp-AWQ-4bit

🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake platform for free. Highlights KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy ⚡ — when evaluated strictly with the SWE-agent scaffold. KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers. We rewrote the attention kernel and redesigned the training engine for shared prefix trajectories to achieve highly efficient RL training, especially for scaffolds leveraging context management. Furthermore, to prevent exploration collapse observed in RL training, we reshaped advantage distribution based on pass rates: amplifying the advantage scale of highly exploratory groups while reducing that of low-exploration ones.

NaNK
license:apache-2.0
325
2

Qwen3-30B-A3B-Thinking-2507-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Inference Please install the latest vllm releases for better support: Qwen3-30B-A3B-Thinking-2507-AWQ-8bit example usage: Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-30B-A3B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Gemini2.5-Flash-Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B Thinking | Qwen3-30B-A3B-Thinking-2507 | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 81.9 | 82.8 | 78.5 | 80.9 | | MMLU-Redux | 92.1 | 92.7 | 89.5 | 91.4 | | GPQA | 82.8 | 71.1 | 65.8 | 73.4 | | SuperGPQA | 57.8 | 60.7 | 51.8 | 56.8 | | Reasoning | | | | | | AIME25 | 72.0 | 81.5 | 70.9 | 85.0 | | HMMT25 | 64.2 | 62.5 | 49.8 | 71.4 | | LiveBench 20241125 | 74.3 | 77.1 | 74.3 | 76.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 61.2 | 55.7 | 57.4 | 66.0 | | CFEval | 1995 | 2056 | 1940 | 2044 | | OJBench | 23.5 | 25.6 | 20.7 | 25.1 | | Alignment | | | | | | IFEval | 89.8 | 83.4 | 86.5 | 88.9 | | Arena-Hard v2$ | 56.7 | 61.5 | 36.3 | 56.0 | | Creative Writing v3 | 85.0 | 84.6 | 79.1 | 84.4 | | WritingBench | 83.9 | 80.3 | 77.0 | 85.0 | | Agent | | | | | | BFCL-v3 | 68.6 | 70.8 | 69.1 | 72.4 | | TAU1-Retail | 65.2 | 54.8 | 61.7 | 67.8 | | TAU1-Airline | 54.0 | 26.0 | 32.0 | 48.0 | | TAU2-Retail | 66.7 | 40.4 | 34.2 | 58.8 | | TAU2-Airline | 52.0 | 30.0 | 36.0 | 58.0 | | TAU2-Telecom | 31.6 | 21.9 | 22.8 | 26.3 | | Multilingualism | | | | | | MultiIF | 74.4 | 71.9 | 72.2 | 76.4 | | MMLU-ProX | 80.2 | 80.0 | 73.1 | 76.4 | | INCLUDE | 83.9 | 78.7 | 71.9 | 74.4 | | PolyMATH | 49.8 | 54.7 | 46.1 | 52.6 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-30b-a3b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --served-model-name Qwen3-30B-A3B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-30B-A3B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
320
0

Magistral-Small-2509-AWQ-8bit

Building upon Mistral Small 3.2 (2506), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. - Multimodality: The model now has a vision encoder and can take multimodal inputs, extending its reasoning capabilities to vision. - Performance upgrade: Magistral Small 1.2 should give you significantly better performance than Magistral Small 1.1 as seen in the benchmark results. - Better tone and persona: You should experience better LaTeX and Markdown formatting, and shorter answers on easy general prompts. - Finite generation: The model is less likely to enter infinite generation loops. - Special think tokens: [THINK] and [/THINK] special tokens encapsulate the reasoning content in a thinking chunk. This makes it easier to parse the reasoning trace and prevents confusion when the '[THINK]' token is given as a string in the prompt. - Reasoning prompt: The reasoning prompt is given in the system prompt. - Reasoning: Capable of long chains of reasoning traces before providing an answer. - Multilingual: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi. - Vision: Vision capabilities enable the model to analyze images and reason based on visual content in addition to text. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. Performance might degrade past 40k but Magistral should still give good results. Hence we recommend to leave the maximum model length to 128k and only lower if you encounter low performance. | Model | AIME24 pass@1 | AIME25 pass@1 | GPQA Diamond | Livecodebench (v5) | |--------------------------|---------------|---------------|--------------|--------------------| | Magistral Medium 1.2 | 91.82% | 83.48% | 76.26% | 75.00% | | Magistral Medium 1.1 | 72.03% | 60.99% | 71.46% | 59.35% | | Magistral Medium 1.0 | 73.59% | 64.95% | 70.83% | 59.36% | | Magistral Small 1.2 | 86.14% | 77.34% | 70.07% | 70.88% | | Magistral Small 1.1 | 70.52% | 62.03% | 65.78% | 59.17% | | Magistral Small 1.0 | 70.68% | 62.76% | 68.18% | 55.84% | Please make sure to use: - `topp`: 0.95 - `temperature`: 0.7 - `maxtokens`: 131072 We highly recommend including the following system prompt for the best results, you can edit and customise it if needed for your specific use case. The `[THINK]` and `[/THINK]` are special tokens that must be encoded as such. Please make sure to use mistral-common as the source of truth. Find below examples from libraries supporting `mistral-common`. We invite you to choose, depending on your use case and requirements, between keeping reasoning traces during multi-turn interactions or keeping only the final assistant response. The model can be used with the following frameworks. - `vllm (recommended)`: See below - `transformers`: See below - `llama.cpp`: See https://huggingface.co/mistralai/Magistral-Small-2509-GGUF - `Unsloth GGUFs`: See https://huggingface.co/unsloth/Magistral-Small-2509-GGUF - `Kaggle`: See https://www.kaggle.com/models/mistral-ai/magistral-small-2509 - `Axolotl`: See https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/magistral - `Unsloth`: See https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-and-run-llms/magistral-how-to-run-and-fine-tune We recommend using this model with the vLLM library to implement production-ready inference pipelines. Doing so should automatically install `mistralcommon >= 1.8.5`. You can also make use of a ready-to-go docker image or on the docker hub. Make sure you install the latest `Transformers` version:

NaNK
license:apache-2.0
314
1

Qwen3-VL-30B-A3B-Instruct-AWQ-8bit

NaNK
license:apache-2.0
307
4

NVIDIA-Nemotron-Nano-9B-v2-AWQ-8bit

The pretraining data has a cutoff date of September 2024. NVIDIA-Nemotron-Nano-9B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks. The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL. The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen. GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. We evaluated our model in Reasoning-On mode across all benchmarks, except RULER, which is evaluated in Reasoning-Off mode. | Benchmark | Qwen3-8B | NVIDIA-Nemotron-Nano-9B-v2 | | :---- | ----: | ----: | | AIME25 | 69.3% | 72.1% | | MATH500 | 96.3% | 97.8% | | GPQA | 59.6% | 64.0% | | LCB | 59.5% | 71.1% | | BFCL v3 | 66.3% | 66.9% | | IFEval (Instruction Strict) | 89.4% | 90.3% | | HLE | 4.4% | 6.5% | | RULER (128K) | 74.1% | 78.9% | All evaluations were done using NeMo-Skills. We published a tutorial with all details necessary to reproduce our evaluation results. This model supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think". - Architecture Type: Mamba2-Transformer Hybrid - Network Architecture: Nemotron-Hybrid NVIDIA-Nemotron-Nano-9B-v2 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Spanish and Japanese) are also supported. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. - Huggingface 08/18/2025 via https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2 - API Catalog 08/18/2025 via https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 - NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model - Input Type(s): Text - Input Format(s): String - Input Parameters: One-Dimensional (1D): Sequences - Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English. - Output Type(s): Text - Output Format: String - Output Parameters: One-Dimensional (1D): Sequences up to 128K Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. - Runtime Engine(s): NeMo 25.07.nemotron-nano-v2 - Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100 - Operating System(s): Linux The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3). Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True` Case 2: `/nothink` is provided, reasoning will be set to `False` Note: `/think` or `/nothink` keywords can also be provided in “user” messages for turn-level reasoning control. We recommend setting `temperature` to `0.6`, `topp` to `0.95` for reasoning True and greedy search for reasoning False, and increase `maxnewtokens` to `1024` or higher for reasoning True. The snippet below shows how to use this model with TRT-LLM. We tested this on the following commit and followed these instructions to build and install TRT-LLM in a docker container. The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM. Note: - Remember to add \`--mamba\ssm\cache\dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade. - If you encounter a CUDA OOM issue, try `--max-num-seqs 64` and consider lower the value further if the error persists. Alternativly, you can use Docker to launch a vLLM server. The thinking budget allows developers to keep accuracy high and meet response‑time targets \- which is especially crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts. With budget control, you can set a limit for internal reasoning: `maxthinkingtokens`: This is a threshold that will attempt to end the reasoning trace at the next newline encountered in the reasoning trace. If no newline is encountered within 500 tokens, it will abruptly end the reasoning trace at \`max\thinking\tokens \+ 500\`. Calling the server with a budget (Restricted to 32 tokens here as an example) After launching a vLLM server, you can call the server with tool-call support using a Python script like below: We follow the jinja chat template provided below. This template conditionally adds ` \n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds ` ` to the start of the Assistant response if `/nothink` is found in the system prompt. Thus enforcing reasoning on/off behavior. Data Modality: Text Text Training Data Size: More than 10 Trillion Tokens Train/Test/Valid Split: We used 100% of the corpus for pre-training and relied on external benchmarks for testing. Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic Properties: The post-training corpus for NVIDIA-Nemotron-Nano-9B-v2 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B. The pre-training corpus for NVIDIA-Nemotron-Nano-9B-v2 consists of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately twenty trillion tokens. Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes. More details on the datasets and synthetic data generation methods can be found in the technical report NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model . | Dataset | Collection Period | | :---- | :---- | | Problems in Elementary Mathematics for Home Study | 4/23/2025 | | GSM8K | 4/23/2025 | | PRM800K | 4/23/2025 | | CC-NEWS | 4/23/2025 | | Common Crawl | 4/23/2025 | | Wikimedia | 4/23/2025 | | Bespoke-Stratos-17k | 4/23/2025 | | tigerbot-kaggle-leetcodesolutions-en-2k | 4/23/2025 | | glaive-function-calling-v2 | 4/23/2025 | | APIGen Function-Calling | 4/23/2025 | | LMSYS-Chat-1M | 4/23/2025 | | Open Textbook Library \- CC BY-SA & GNU subset and OpenStax \- CC BY-SA subset | 4/23/2025 | | Advanced Reasoning Benchmark, tigerbot-kaggle-leetcodesolutions-en-2k, PRM800K, and SciBench | 4/23/2025 | | FineWeb-2 | 4/23/2025 | | Court Listener | Legacy Download | | peS2o | Legacy Download | | OpenWebMath | Legacy Download | | BioRxiv | Legacy Download | | PMC Open Access Subset | Legacy Download | | OpenWebText2 | Legacy Download | | Stack Exchange Data Dump | Legacy Download | | PubMed Abstracts | Legacy Download | | NIH ExPorter | Legacy Download | | arXiv | Legacy Download | | BigScience Workshop Datasets | Legacy Download | | Reddit Dataset | Legacy Download | | SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) | Legacy Download | | Public Software Heritage S3 | Legacy Download | | The Stack | Legacy Download | | mC4 | Legacy Download | | Advanced Mathematical Problem Solving | Legacy Download | | MathPile | Legacy Download | | NuminaMath CoT | Legacy Download | | PMC Article | Legacy Download | | FLAN | Legacy Download | | Advanced Reasoning Benchmark | Legacy Download | | SciBench | Legacy Download | | WikiTableQuestions | Legacy Download | | FinQA | Legacy Download | | Riddles | Legacy Download | | Problems in Elementary Mathematics for Home Study | Legacy Download | | MedMCQA | Legacy Download | | Cosmos QA | Legacy Download | | MCTest | Legacy Download | | AI2's Reasoning Challenge | Legacy Download | | OpenBookQA | Legacy Download | | MMLU Auxiliary Train | Legacy Download | | social-chemestry-101 | Legacy Download | | Moral Stories | Legacy Download | | The Common Pile v0.1 | Legacy Download | | FineMath | Legacy Download | | MegaMath | Legacy Download | | FastChat | 6/30/2025 | Private Non-publicly Accessible Datasets of Third Parties | Dataset | | :---- | | Global Regulation | | Workbench | The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper. Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC. The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the technical report). | Dataset | Modality | Dataset Size (Tokens) | Collection Period | | :---- | :---- | :---- | :---- | | English Common Crawl | Text | 3.360T | 4/8/2025 | | Multilingual Common Crawl | Text | 812.7B | 5/1/2025 | | GitHub Crawl | Text | 747.4B | 4/29/2025 | | Dataset | Modality | Dataset Size (Tokens) | Seed Dataset | Model(s) used for generation | | :---- | :---- | :---- | :---- | :---- | | Synthetic Art of Problem Solving from DeepSeek-R1 | Text | 25.5B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; | DeepSeek-R1 | | Synthetic Moral Stories and Social Chemistry from Mixtral-8x22B-v0.1 | Text | 327M | social-chemestry-101; Moral Stories | Mixtral-8x22B-v0.1 | | Synthetic Social Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 83.6M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic Health Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 9.7M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic STEM seeded with OpenStax, Open Textbook Library, and GSM8K from DeepSeek-R1, DeepSeek-V3, DeepSeek-V3-0324, and Qwen2.5-72B | Text | 175M | OpenStax \- CC BY-SA subset; GSM8K; Open Textbook Library \- CC BY-SA & GNU subset | DeepSeek-R1, DeepSeek-V3; DeepSeek-V3-0324; Qwen2.5-72B | | Nemotron-PrismMath | Text | 4.6B | Big-Math-RL-Verified; OpenR1-Math-220k | Qwen2.5-0.5B-instruct, Qwen2.5-72B-Instruct; DeepSeek-R1-Distill-Qwen-32B | | Synthetic Question Answering Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 350M | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic FineMath-4+ Reprocessed from DeepSeek-V3 | Text | 9.2B | Common Crawl | DeepSeek-V3 | | Synthetic FineMath-3+ Reprocessed from phi-4 | Text | 27.6B | Common Crawl | phi-4 | | Synthetic Union-3+ Reprocessed from phi-4 | Text | 93.1B | Common Crawl | phi-4 | | Refreshed Nemotron-MIND from phi-4 | Text | 73B | Common Crawl | phi-4 | | Synthetic Union-4+ Reprocessed from phi-4 | Text | 14.12B | Common Crawl | phi-4 | | Synthetic Union-3+ minus 4+ Reprocessed from phi-4 | Text | 78.95B | Common Crawl | phi-4 | | Synthetic Union-3 Refreshed from phi-4 | Text | 80.94B | Common Crawl | phi-4 | | Synthetic Union-4+ Refreshed from phi-4 | Text | 52.32B | Common Crawl | phi-4 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from DeepSeek-V3 and DeepSeek-V3-0324 | Text | 4.0B | AQUA-RAT; LogiQA; AR-LSAT | DeepSeek-V3; DeepSeek-V3-0324 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | AQUA-RAT; LogiQA; AR-LSAT | Qwen3-30B-A3B | | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; GSM8K; PRM800K | Qwen2.5-32B-Instruct; Qwen2.5-Math-72B; Qwen2.5-Math-7B; Qwen2.5-72B-Instruct | | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | MMLU Auxiliary Train | DeepSeek-R1 | | Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | Common Crawl | Qwen3-30B-A3B; Mistral-NeMo-12B-Instruct | | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | Common Crawl | Qwen3-30B-A3B | | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | Wikimedia | Qwen3-30B-A3B | | Synthetic OpenMathReasoning from DeepSeek-R1-0528 | Text | 1.5M | OpenMathReasoning | DeepSeek-R1-0528 | | Synthetic OpenCodeReasoning from DeepSeek-R1-0528 | Text | 1.1M | OpenCodeReasoning | DeepSeek-R1-0528 | | Synthetic Science Data from DeepSeek-R1-0528 | Text | 1.5M | \- | DeepSeek-R1-0528 | | Synthetic Humanity's Last Exam from DeepSeek-R1-0528 | Text | 460K | Humanity's Last Exam | DeepSeek-R1-0528 | | Synthetic ToolBench from Qwen3-235B-A22B | Text | 400K | ToolBench | Qwen3-235B-A22B | | Synthetic Nemotron Content Safety Dataset V2, eval-safety, Gretel Synthetic Safety Alignment, and RedTeam\2K from DeepSeek-R1-0528 | Text | 52K | Nemotron Content Safety Dataset V2; eval-safety; Gretel Synthetic Safety Alignment; RedTeam\2K | DeepSeek-R1-0528 | | Synthetic HelpSteer from Qwen3-235B-A22B | Text | 120K | HelpSteer3; HelpSteer2 | Qwen3-235B-A22B | | Synthetic Alignment data from Mixtral-8x22B-Instruct-v0.1, Mixtral-8x7B-Instruct-v0.1, and Nemotron-4 Family | Text | 400K | HelpSteer2; C4; LMSYS-Chat-1M; ShareGPT52K; tigerbot-kaggle-leetcodesolutions-en-2k; GSM8K; PRM800K; lm\identity (NVIDIA internal); FinQA; WikiTableQuestions; Riddles; ChatQA nvolve-multiturn (NVIDIA internal); glaive-function-calling-v2; SciBench; OpenBookQA; Advanced Reasoning Benchmark; Public Software Heritage S3; Khan Academy Math Keywords | Nemotron-4-15B-Base (NVIDIA internal); Nemotron-4-15B-Instruct (NVIDIA internal); Nemotron-4-340B-Base; Nemotron-4-340B-Instruct; Nemotron-4-340B-Reward; Mixtral-8x7B-Instruct-v0.1; Mixtral-8x22B-Instruct-v0.1 | | Synthetic LMSYS-Chat-1M from Qwen3-235B-A22B | Text | 1M | LMSYS-Chat-1M | Qwen3-235B-A22B | | Synthetic Multilingual Reasoning data from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, and Qwen2.5-14B-Instruct | Text | 25M | OpenMathReasoning; OpenCodeReasoning | DeepSeek-R1-0528; Qwen2.5-32B-Instruct-AWQ (translation); Qwen2.5-14B-Instruct (translation); | | Synthetic Multilingual Reasoning data from Qwen3-235B-A22B and Gemma 3 Post-Trained models | Text | 5M | WildChat | Qwen3-235B-A22B; Gemma 3 PT 12B; Gemma 3 PT 27B | Data Collection Method by dataset: Hybrid: Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

NaNK
251
0

Ring-flash-2.0-AWQ-8bit

🤗 Hugging Face &nbsp&nbsp | &nbsp&nbsp🤖 ModelScope Today, we are officially open-sourcing Ring-flash-2.0. This is a high-performance thinking model, deeply optimized based on Ling-flash-2.0-base. Like Ling-flash-2.0, Ring-flash-2.0 has a total of 100B parameters, with only 6.1B activated per inference. Our independently developed icepop algorithm has successfully addressed the challenge of training instability in reinforcement learning (RL) for MoE LLMs after cold-start Long-CoT SFT, enabling the model’s complex reasoning capabilities to continuously improve throughout extended RL training cycles. Ring-flash-2.0 demonstrates significant breakthroughs across multiple challenging benchmarks, including math competitions, code generation, and logical reasoning. Its performance not only surpasses that of SOTA dense models under 40B parameters but also rivals larger open-weight MoE models and closed-source high-performance thinking model APIs. We selected representative open-source thinking models and closed-source APIs for comparison, including GPT-OSS-120B(medium), Qwen3-32B-Thinking, Seed-OSS-36B-Instruct, and Gemini-2.5-Flash. The benchmarking results demonstrate that Ring-flash-2.0 exhibits leading performance across multiple challenging general reasoning tasks, including: - Math competitions (AIME 25, Omni-MATH), - Code generation (LiveCodeBench, CodeForce-Elo), - Logical reasoning (ARC-Prize). It also shows strong competitiveness in specialized domains such as: - Scientific and medical reasoning (GPQA-Diamond, HealthBench). More surprisingly, although Ring-flash-2.0 is primarily designed for complex reasoning, it outperforms all other compared models in creative writing (Creative Writing v3) and matches the creative capability of its "twin brother"—the non-thinking model Ling-flash-2.0. Building on the highly efficient MoE architecture of the Ling 2.0 series, and through structural optimizations such as a 1/32 expert activation ratio and MTP layers, Ring-flash-2.0 activates only 6.1B (4.8B non-embedding) parameters while delivering performance comparable to a ∼40B dense model. Thanks to its low activation and high sparsity design, Ring-flash-2.0 achieves a high generation speed of 200+ tokens/sec when deployed on just four H20 GPUs, significantly reducing inference costs for thinking models in high-concurrency scenarios. IcePop: Cooling Down Training-Inference Gaps in RL for MoE Models During the RL for MoE models, the discrepancy of precision between the training and inference engines is more pronounced compared to dense models. This gap widens progressively as sequence length and training steps increase—particularly during long-sequence generation and extended training cycles. A more critical issue is that the original GRPO algorithm begins to break down within a limited number of training steps. Specifically, the probabilistic discrepancy for the same token between training and inference phases gradually increases. When this relative difference exceeds 5%, training effectively fails, posing a significant challenge for long-horizon reinforcement learning with lengthy sequences. To address this issue, we introduced a key solution: distribution calibration via masked bidirectional truncation, which effectively narrows the gap between training and inference. - Bidirectional Truncation: We truncate not only tokens where the training probability is significantly higher than the inference probability but also the reverse scenario where the training probability is much lower. - Masking: Tokens with excessively large discrepancies are excluded from gradient computation. For detailed algorithm introduction, please refer to our technical blog: https://ringtech.notion.site/icepop SFT + RLVR + RLHF Multi-Stage Training To comprehensively enhance the capabilities of Ring-flash-2.0, we designed a Two-staged RL pipeline. First, lightweight Long-CoT SFT equips the Ling-flash-2.0-base model with diverse thinking patterns. This is followed by RL training with Verifiable Rewards (RLVR) to continually stimulate the model’s reasoning potential. Finally, an RLHF phase is incorporated to improve the model’s general abilities. During RL training, we compared directly combining RLVR and RLHF into joint training with the ultimately adopted Two-staged RL pipeline. Both approaches showed relatively similar effectiveness in our experiments. However, due to the differing difficulty levels of RLVR and RLHF tasks—with RLHF involving relatively shorter model rollouts—joint training resulted in more long-tail generations. From an engineering efficiency perspective, we ultimately adopted the Two-staged RL approach. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation: BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODELPATH}. They both share the same command in the following: MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN` to start command. We recommend you to use Llama-Factory to finetune Ring. This code repository is licensed under the MIT License. Tip To facilitate academic research and downstream applications with customizable model naming, we did not conduct specific identity recognition training.

NaNK
license:mit
236
0

Qwen3-VL-32B-Instruct-AWQ-8bit

- Quantization Method: AWQ - Bits: 8 - Group Size: 32 - Calibration Dataset: HuggingFaceM4/FineVision - Quantization Tool: llm-compressor Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
218
0

Qwopus3.5-27B-v3-AWQ-BF16-INT8

NaNK
license:apache-2.0
207
1

Qwen3-Coder-30B-A3B-Instruct-GPTQ-8bit

Method Quantised using vllm-project/llm-compressor, nvidia/Llama-Nemotron-Post-Training-Dataset and the following configs: Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-Coder-30B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
202
2

Qwen3-4B-Instruct-2507-AWQ-8bit

NaNK
license:apache-2.0
193
1

Ring-flash-2.0-AWQ-4bit

🤗 Hugging Face &nbsp&nbsp | &nbsp&nbsp🤖 ModelScope Today, we are officially open-sourcing Ring-flash-2.0. This is a high-performance thinking model, deeply optimized based on Ling-flash-2.0-base. Like Ling-flash-2.0, Ring-flash-2.0 has a total of 100B parameters, with only 6.1B activated per inference. Our independently developed icepop algorithm has successfully addressed the challenge of training instability in reinforcement learning (RL) for MoE LLMs after cold-start Long-CoT SFT, enabling the model’s complex reasoning capabilities to continuously improve throughout extended RL training cycles. Ring-flash-2.0 demonstrates significant breakthroughs across multiple challenging benchmarks, including math competitions, code generation, and logical reasoning. Its performance not only surpasses that of SOTA dense models under 40B parameters but also rivals larger open-weight MoE models and closed-source high-performance thinking model APIs. We selected representative open-source thinking models and closed-source APIs for comparison, including GPT-OSS-120B(medium), Qwen3-32B-Thinking, Seed-OSS-36B-Instruct, and Gemini-2.5-Flash. The benchmarking results demonstrate that Ring-flash-2.0 exhibits leading performance across multiple challenging general reasoning tasks, including: - Math competitions (AIME 25, Omni-MATH), - Code generation (LiveCodeBench, CodeForce-Elo), - Logical reasoning (ARC-Prize). It also shows strong competitiveness in specialized domains such as: - Scientific and medical reasoning (GPQA-Diamond, HealthBench). More surprisingly, although Ring-flash-2.0 is primarily designed for complex reasoning, it outperforms all other compared models in creative writing (Creative Writing v3) and matches the creative capability of its "twin brother"—the non-thinking model Ling-flash-2.0. Building on the highly efficient MoE architecture of the Ling 2.0 series, and through structural optimizations such as a 1/32 expert activation ratio and MTP layers, Ring-flash-2.0 activates only 6.1B (4.8B non-embedding) parameters while delivering performance comparable to a ∼40B dense model. Thanks to its low activation and high sparsity design, Ring-flash-2.0 achieves a high generation speed of 200+ tokens/sec when deployed on just four H20 GPUs, significantly reducing inference costs for thinking models in high-concurrency scenarios. IcePop: Cooling Down Training-Inference Gaps in RL for MoE Models During the RL for MoE models, the discrepancy of precision between the training and inference engines is more pronounced compared to dense models. This gap widens progressively as sequence length and training steps increase—particularly during long-sequence generation and extended training cycles. A more critical issue is that the original GRPO algorithm begins to break down within a limited number of training steps. Specifically, the probabilistic discrepancy for the same token between training and inference phases gradually increases. When this relative difference exceeds 5%, training effectively fails, posing a significant challenge for long-horizon reinforcement learning with lengthy sequences. To address this issue, we introduced a key solution: distribution calibration via masked bidirectional truncation, which effectively narrows the gap between training and inference. - Bidirectional Truncation: We truncate not only tokens where the training probability is significantly higher than the inference probability but also the reverse scenario where the training probability is much lower. - Masking: Tokens with excessively large discrepancies are excluded from gradient computation. For detailed algorithm introduction, please refer to our technical blog: https://ringtech.notion.site/icepop SFT + RLVR + RLHF Multi-Stage Training To comprehensively enhance the capabilities of Ring-flash-2.0, we designed a Two-staged RL pipeline. First, lightweight Long-CoT SFT equips the Ling-flash-2.0-base model with diverse thinking patterns. This is followed by RL training with Verifiable Rewards (RLVR) to continually stimulate the model’s reasoning potential. Finally, an RLHF phase is incorporated to improve the model’s general abilities. During RL training, we compared directly combining RLVR and RLHF into joint training with the ultimately adopted Two-staged RL pipeline. Both approaches showed relatively similar effectiveness in our experiments. However, due to the differing difficulty levels of RLVR and RLHF tasks—with RLHF involving relatively shorter model rollouts—joint training resulted in more long-tail generations. From an engineering efficiency perspective, we ultimately adopted the Two-staged RL approach. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation: BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODELPATH}. They both share the same command in the following: MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN` to start command. We recommend you to use Llama-Factory to finetune Ring. This code repository is licensed under the MIT License. Tip To facilitate academic research and downstream applications with customizable model naming, we did not conduct specific identity recognition training.

NaNK
license:mit
191
2

GLM-4.5-AWQ-4bit

Method Quantised using vllm-project/llm-compressor, nvidia/Llama-Nemotron-Post-Training-Dataset and the following configs: Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations. Inference vllm Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensorparallelsize 📍 Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
185
1

Qwen3-VL-30B-A3B-Thinking-AWQ-8bit

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
179
3

granite-4.0-h-small-AWQ-8bit

- Quantization method: AWQ - Bits: 8 - Group Size: 32 - Calibration Dataset: nvidia/Llama-Nemotron-Post-Training-Dataset - Quantization Tool: llm-compressor - The model can not load with tensor parallelism and pipeline parallelism. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Small model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Small comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Small model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Small baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
164
0

Ling-mini-2.0-AWQ-4bit

NaNK
license:mit
158
2

InternVL3_5-14B-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[�� InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | 🤗 link | 🤖 link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | 🤗 link | 🤖 link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | �� link | 🤖 link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | 🤗 link | 🤖 link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-1B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-1B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-2B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-2B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-2B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-4B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-4B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-4B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-8B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-8B-Instruct | CPT + SFT | 🤗 link | �� link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | �� link | 🤖 link | | InternVL3.5-8B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-14B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-14B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-14B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-38B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-38B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-38B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
152
2

Qwen3-VL-32B-Thinking-AWQ-8bit

- Quantization Method: AWQ - Bits: 8 - Group Size: 32 - Calibration Dataset: 5CD-AI/LLaVA-CoT-o1-Instruct - Quantization Tool: llm-compressor Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
134
1

Ring-mini-2.0-AWQ-4bit

NaNK
license:mit
129
2

Apertus-8B-Instruct-2509-GPTQ-4bit

NaNK
license:apache-2.0
129
0

gpt-oss-20b-BF16

NaNK
license:apache-2.0
115
2

Qwen3-Omni-30B-A3B-Thinking-AWQ-8bit

Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. Key features: State-of-the-art across modalities: Early text-first pretraining and mixed multimodal training provide native multimodal support. While achieving strong audio and audio-video results, unimodal text and image performance does not regress. Reaches SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36; ASR, audio understanding, and voice conversation performance is comparable to Gemini 2.5 Pro. Multilingual: Supports 119 text languages, 19 speech input languages, and 10 speech output languages. - Speech Input: English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, Urdu. - Speech Output: English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean. Novel Architecture: MoE-based Thinker–Talker design with AuT pretraining for strong general representations, plus a multi-codebook design that drives latency to a minimum. Real-time Audio/Video Interaction: Low-latency streaming with natural turn-taking and immediate text or speech responses. Flexible Control: Customize behavior via system prompts for fine-grained control and easy adaptation. Detailed Audio Captioner: Qwen3-Omni-30B-A3B-Captioner is now open source: a general-purpose, highly detailed, low-hallucination audio captioning model that fills a critical gap in the open-source community. Qwen3-Omni supports a wide range of multimodal application scenarios, covering various domain tasks involving audio, image, video, and audio-visual modalities. Below are several cookbooks demonstrating the usage cases of Qwen3-Omni and these cookbooks include our actual execution logs. You can first follow the QuickStart guide to download the model and install the necessary inference environment dependencies, then run and experiment locally—try modifying prompts or switching model types, and enjoy exploring the capabilities of Qwen3-Omni! Audio Speech Recognition Speech recognition, supporting multiple languages and long audio. Speech Translation Speech-to-Text / Speech-to-Speech translation. Music Analysis Detailed analysis and appreciation of any music, including style, genre, rhythm, etc. Sound Analysis Description and analysis of various sound effects and audio signals. Audio Caption Audio captioning, detailed description of any audio input. Mixed Audio Analysis Analysis of mixed audio content, such as speech, music, and environmental sounds. Image Question Answering arbitrary questions about any image. Image Math Solving complex mathematical problems in images, highlighting the capabilities of the Thinking model. Video Description Detailed description of video content. Video Navigation Generating navigation commands from first-person motion videos. Video Scene Transition Analysis of scene transitions in videos. Audio-Visual Audio Visual Question Answering arbitrary questions in audio-visual scenarios, demonstrating the model's ability to model temporal alignment between audio and video. Audio Visual Interaction Interactive communication with the model using audio-visual inputs, including task specification via audio. Audio Visual Dialogue Conversational interaction with the model using audio-visual inputs, showcasing its capabilities in casual chat and assistant-like behavior. Agent Audio Function Call Using audio input to perform function calls, enabling agent-like behaviors. Downstream Task Fine-tuning Omni Captioner Introduction and capability demonstration of Qwen3-Omni-30B-A3B-Captioner , a downstream fine-tuned model based on Qwen3-Omni-30B-A3B-Instruct, illustrating the strong generalization ability of the Qwen3-Omni foundation model. Below is the description of all Qwen3-Omni models. Please select and download the model that fits your needs. | Model Name | Description | |------------------------------|-------------| | Qwen3-Omni-30B-A3B-Instruct | The Instruct model of Qwen3-Omni-30B-A3B, containing both thinker and talker, supporting audio, video, and text input, with audio and text output. For more information, please read the Qwen3-Omni Technical Report. | | Qwen3-Omni-30B-A3B-Thinking | The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. For more information, please read the Qwen3-Omni Technical Report.| | Qwen3-Omni-30B-A3B-Captioner | A downstream audio fine-grained caption model fine-tuned from Qwen3-Omni-30B-A3B-Instruct, which produces detailed, low-hallucination captions for arbitrary audio inputs. It contains the thinker, supporting audio input and text output. For more information, you can refer to the model's cookbook. | During loading in Hugging Face Transformers or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory: The Hugging Face Transformers code for Qwen3-Omni has been successfully merged, but the PyPI package has not yet been released. Therefore, you need to install it from source using the following command. We strongly recommend that you create a new Python environment to avoid environment runtime issues. We offer a toolkit to help you handle various types of audio and visual input more conveniently, providing an API-like experience. This includes support for base64, URLs, and interleaved audio, images, and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: Additionally, we recommend using FlashAttention 2 when running with Hugging Face Transformers to reduce GPU memory usage. However, if you are primarily using vLLM for inference, this installation is not necessary, as vLLM includes FlashAttention 2 by default. Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the FlashAttention repository. FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. Here is a code snippet to show you how to use Qwen3-Omni with `transformers` and `qwenomniutils`: Here are some more advanced usage examples. You can expand the sections below to learn more. The model can batch inputs composed of mixed samples of various types such as text, images, audio, and videos as input when `returnaudio=False` is set. Here is an example. The model supports both text and audio outputs. If users do not need audio outputs, they can call `model.disabletalker()` after initializing the model. This option will save about `10GB` of GPU memory, but the `returnaudio` option for the `generate` function will only allow `False`. For a more flexible experience, we recommend that users decide whether to return audio when the `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs, resulting in faster text responses. Qwen3-Omni supports changing the voice of the output audio. The `"Qwen/Qwen3-Omni-30B-A3B-Instruct"` checkpoint supports three voice types as follows: | Voice Type | Gender | Description | |------------|--------|-------------| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe. | | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity. | | Aiden | Male | A warm, laid-back American voice with a gentle, boyish charm. | Users can use the `speaker` parameter of the `generate` function to specify the voice type. By default, if `speaker` is not specified, the voice type is `Ethan`. We strongly recommend using vLLM for inference and deployment of the Qwen3-Omni series models. Since our code is currently in the pull request stage, and audio output inference support for the Instruct model will be released in the near future, you can follow the commands below to install vLLM from source. Please note that we recommend you create a new Python environment to avoid runtime environment conflicts and incompatibilities. For more details on compiling vLLM from source, please refer to the vLLM official documentation. You can use the following code for vLLM inference. The `limitmmperprompt` parameter specifies the maximum number of each modality's data allowed per message. Since vLLM needs to pre-allocate GPU memory, larger values will require more GPU memory; if OOM issues occur, try reducing this value. Setting `tensorparallelsize` greater than one enables multi-GPU parallel inference, improving concurrency and throughput. In addition, `maxnumseqs` indicates the number of sequences that vLLM processes in parallel during each inference step. A larger value requires more GPU memory but enables higher batch inference speed. For more details, please refer to the vLLM official documentation. Below is a simple example of how to run Qwen3-Omni with vLLM: Here are some more advanced usage examples. You can expand the sections below to learn more. Using vLLM enables fast batch inference, which can help you efficiently process large volumes of data or conduct benchmarking. Refer to the following code example: vLLM serve for Qwen3-Omni currently only supports the thinker model. The `useaudioinvideo` parameter is not available in vLLM serve; you can handle this by separately passing video and audio inputs for processing. You can start vLLM serve through the following command: Then you can use the chat API as below (via curl, for example): | Model | Precision | 15s Video | 30s Video | 60s Video | 120s Video | |------------------------------|-----------| --------- | --------- | --------- | --------- | | Qwen3-Omni-30B-A3B-Instruct | BF16 | 78.85 GB | 88.52 GB | 107.74 GB | 144.81 GB | | Qwen3-Omni-30B-A3B-Thinking | BF16 | 68.74 GB | 77.79 GB | 95.76 GB | 131.65 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` precision, tested with `attnimplementation="flashattention2"`. The Instruct model includes both the thinker and talker components, whereas the Thinking model includes only the thinker part. When using Qwen3-Omni for audio-visual multimodal interaction, where the input consists of a video and its corresponding audio (with the audio serving as a query), we recommend using the following system prompt. This setup helps the model maintain high reasoning capability while better assuming interactive roles such as a smart assistant. Additionally, the text generated by the thinker will be more readable, with a natural, conversational tone and without complex formatting that is difficult to vocalize, leading to more stable and fluent audio output from the talker. You can customize the `usersystemprompt` field in the system prompt to include character settings or other role-specific descriptions as needed. The `Qwen3-Omni-30B-A3B-Thinking` model is primarily designed for understanding and interacting with multimodal inputs, including text, audio, image, and video. To achieve optimal performance, we recommend that users include an explicit textual instruction or task description in each round of dialogue alongside the multimodal input. This helps clarify the intent and significantly enhances the model's ability to leverage its reasoning capabilities. For example: In multimodal interaction, user-provided videos are often accompanied by audio (such as spoken questions or sounds from events in the video). This information helps the model provide a better interactive experience. We provide the following options for users to decide whether to use the audio from a video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter must be set consistently across these steps; otherwise, unexpected results may occur. Qwen3-Omni maintains state-of-the-art performance on text and visual modalities without degradation relative to same-size single-model Qwen counterparts. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and sets the SOTA on 22, outperforming strong closed-source systems such as Gemini 2.5 Pro and GPT-4o. GPT-4o-0327 Qwen3-235B-A22B Non Thinking Qwen3-30B-A3B-Instruct-2507 Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Multilingual Tasks MultiIF 70.4 70.2 67.9 64.0 64.7 Gemini-2.5-Flash Thinking Qwen3-235B-A22B Thinking Qwen3-30B-A3B-Thinking-2507 Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Multilingual Tasks MultiIF 74.4 71.9 76.4 72.9 73.2 Seed-ASR Voxtral-Mini Voxtral-Small GPT-4o-Transcribe Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Wenetspeech net | meeting 4.66 | 5.69 24.30 | 31.53 20.33 | 26.08 15.30 | 32.27 14.43 | 13.47 5.91 | 7.65 4.69 | 5.89 4.62 | 5.75 Librispeech clean | other 1.58 | 2.84 1.88 | 4.12 1.56 | 3.30 1.39 | 3.75 2.89 | 3.56 1.74 | 3.45 1.22 | 2.48 1.27 | 2.44 Fleurs-avg (19 lang) - 15.67 8.09 4.48 5.55 14.04 5.33 5.31 MIR-1K (vocal-only) 6.45 23.33 18.73 11.87 9.85 8.15 5.90 5.85 Opencpop-test 2.98 31.01 16.06 7.93 6.49 2.84 1.54 2.02 Fleurs-en2xx - 30.35 37.85 - 39.25 29.22 37.50 36.22 Fleurs-xx2en - 27.54 32.81 - 35.41 28.61 31.08 30.71 Fleurs-zh2xx - 17.03 22.05 - 26.63 17.97 25.17 25.10 Fleurs-xx2zh - 28.75 34.82 - 37.50 27.68 33.13 31.19 GPT-4o-Audio Gemini-2.5-Flash Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Instruct Qwen3-Omni-Flash-Thinking MMAU-v05.15.25 62.5 71.8 77.4 65.5 77.5 75.4 77.6 76.5 Best Specialist Models GPT-4o-Audio Gemini-2.5-Pro Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct RUL-MuchoMusic 47.6 (Audio Flamingo 3) 36.1 49.4 47.3 52.0 52.1 MTG Genre Micro F1 35.8 (MuQ-MuLan) 25.3 32.6 32.5 39.0 39.5 MTG Mood/Theme Micro F1 10.9 (MuQ-MuLan) 11.3 14.1 8.9 21.0 21.7 MTG Instrument Micro F1 39.8 (MuQ-MuLan) 34.2 33.0 22.6 40.5 40.7 MTG Top50 Micro F1 33.2 (MuQ-MuLan) 25.0 26.1 21.6 36.7 36.9 MagnaTagATune Micro F1 41.6 (MuQ) 29.2 28.1 30.1 44.3 46.8 Datasets GPT4-o Gemini-2.0-Flash Qwen2.5-VL 72B Qwen3-Omni-30B-A3B -Instruct Qwen3-Omni-Flash -Instruct Datasets Gemini-2.5-flash-thinking InternVL-3.5-241B-A28B Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Datasets Previous Open-source SoTA Gemini-2.5-Flash Qwen2.5-Omni Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-Flash-Instruct Datasets Previous Open-source SoTA Gemini-2.5-Flash-Thinking Qwen3-Omni-30B-A3B-Thinking Qwen3-Omni-Flash-Thinking Qwen3-Omni-30B-A3B MiniMax ElevenLabs Qwen3-Omni-30B-A3B MiniMax ElevenLabs Decoding Strategy: For the Qwen3-Omni series across all evaluation benchmarks, `Instruct` models use greedy decoding during generation without sampling. For `Thinking` models, the decoding parameters should be taken from the `generationconfig.json` file in the checkpoint. Benchmark-Specific Formatting: For the majority of evaluation benchmarks, they come with their own ChatML formatting to embed the question or prompt. It should be noted that all video data are set to `fps=2` during evaluation. Default Prompts: For tasks in certain benchmarks that do not include a prompt, we use the following prompt settings: | Task Type | Prompt | | :--- | :--- | | Auto Speech Recognition (ASR) for Chinese | 请将这段中文语音转换为纯文本。 | | Auto Speech Recognition (ASR) for Other languages | Transcribe the audio into text. | | Speech-to-Text Translation (S2TT) | Listen to the provided speech and produce a translation in text. | | Song Lyrics Recognition | Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations. | System Prompt: No `system prompt` should be set for any evaluation benchmark. Input Sequence: The question or prompt should be input as user text. Unless otherwise specified by the benchmark, the text should come after multimodal data in the sequence. For example:

NaNK
109
1

Qwen3-VL-4B-Thinking-AWQ-8bit

NaNK
license:apache-2.0
106
0

Ling-flash-2.0-AWQ-4bit

NaNK
license:mit
98
2

InternVL3_5-38B-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | 🤗 link | 🤖 link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A29B | 🤗 link | 🤖 link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | 🤗 link | 🤖 link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A29B | 🤗 link | 🤖 link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-1B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-1B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-2B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-2B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-2B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-4B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-4B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-4B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-8B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-8B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-8B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-14B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-14B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-14B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-38B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-38B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-38B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
96
0

granite-4.0-h-tiny-AWQ-8bit

- Quantization method: AWQ - Bits: 8 - Group Size: 32 - Calibration Dataset: nvidia/Llama-Nemotron-Post-Training-Dataset - Quantization Tool: llm-compressor - The model can not load with tensor parallelism and pipeline parallelism. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
92
0

Qwopus3.5-27B-v3-AWQ-BF16-INT4

NaNK
license:apache-2.0
90
1

Tongyi-DeepResearch-30B-A3B-AWQ-8bit

We present Tongyi DeepResearch, an agentic large language model featuring 30 billion total parameters, with only 3 billion activated per token. Developed by Tongyi Lab, the model is specifically designed for long-horizon, deep information-seeking tasks. Tongyi-DeepResearch demonstrates state-of-the-art performance across a range of agentic search benchmarks, including Humanity's Last Exam, BrowserComp, BrowserComp-ZH, WebWalkerQA, GAIA, xbench-DeepSearch and FRAMES. - ⚙️ Fully automated synthetic data generation pipeline: We design a highly scalable data synthesis pipeline, which is fully automatic and empowers agentic pre-training, supervised fine-tuning, and reinforcement learning. - 🔄 Large-scale continual pre-training on agentic data: Leveraging diverse, high-quality agentic interaction data to extend model capabilities, maintain freshness, and strengthen reasoning performance. - 🔁 End-to-end reinforcement learning: We employ a strictly on-policy RL approach based on a customized Group Relative Policy Optimization framework, with token-level policy gradients, leave-one-out advantage estimation, and selective filtering of negative samples to stabilize training in a non‑stationary environment. - 🤖 Agent Inference Paradigm Compatibility: At inference, Tongyi-DeepResearch is compatible with two inference paradigms: ReAct, for rigorously evaluating the model's core intrinsic abilities, and an IterResearch-based 'Heavy' mode, which uses a test-time scaling strategy to unlock the model's maximum performance ceiling. You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/DeepResearch.

NaNK
license:apache-2.0
82
3

DeepSeek-V3.1-GPTQ-4bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Note: the last layer, i.e., the MTP layer index 61 is not implemented due to transformers not having MTP layer implementations. Prerequisite As recent vllm and transformers versions are tested to not work on this model, please install and : DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects: - Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template. - Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved. - Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3.1-Base | 671B | 37B | 128K | HuggingFace \| ModelScope | | DeepSeek-V3.1 | 671B | 37B | 128K | HuggingFace \| ModelScope | The details of our chat template is described in `tokenizerconfig.json` and `assets/chattemplate.jinja`. Here is a brief description. With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token ` `. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` By concatenating the context and the prefix, we obtain the correct prompt for the query. The prefix of thinking mode is similar to DeepSeek-R1. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the ` ` is retained in every turn of context. ToolCall Toolcall is supported in non-thinking mode. The format is: ` {system prompt}\n\n{tooldescription} {query} ` where the tooldescription is Code-Agent We support various code agent frameworks. Please refer to the above toolcall format to create your own code agents. An example is shown in `assets/codeagenttrajectory.html`. Search-Agent We design a specific format for searching toolcall in thinking mode, to support search agent. For complex questions that require accessing external or up-to-date information, DeepSeek-V3.1 can leverage a user-provided search tool through a multi-turn tool-calling process. Please refer to the `assets/searchtooltrajectory.html` and `assets/searchpythontooltrajectory.html` for the detailed template. Evaluation | Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |----------|----------------------------------|-----------------|---|---|---| | General | | | MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 | | MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 | | GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 | | Humanity's Last Exam (Pass@1) | - | - | 15.9 | 17.7 |Search Agent| | | BrowseComp | - | - | 30.0 | 8.9 | | BrowseCompzh | - | - | 49.2 | 35.7 | | Humanity's Last Exam (Python + Search) |- | - | 29.8 | 24.8 | | SimpleQA | - | - | 93.4 | 92.3 | Code | | | LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 | | Codeforces-Div1 (Rating) | - | - | 2091 | 1930 | | Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 | Code Agent| | | SWE Verified (Agent mode) | 66.0 | 45.4 | - | 44.6 | | SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | - | 30.5 | | Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | - | 5.7 | Math | | | AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 | | AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 | | HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 | Note: - Search agents are evaluated with our internal search framework, which uses a commercial search API + webpage filter + 128K context window. Seach agent results of R1-0528 are evaluated with a pre-defined workflow. - SWE-bench is evaluated with our internal code agent framework. The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].

NaNK
license:mit
81
0

KAT-Dev-AWQ-8bit

Highlights KAT-Dev-32B is an open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales. KAT-Dev-32B is optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include: 1. Mid-Training We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench). However, since our experiments are based on the Qwen3-32B model, we find that enhancing these foundational capabilities will have a significant impact on the subsequent SFT and RL stages. This suggests that improving such core abilities can profoundly influence the model’s capacity to handle more complex tasks. 2. SFT & RFT We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model’s generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage. Compared with traditional RL, we incorporate “teacher trajectories” annotated by human engineers as guidance during training—much like a learner driver being assisted by an experienced co-driver before officially driving after getting a license. This step not only boosts model performance but also further stabilizes the subsequent RL training. 3. Agentic RL Scaling Scaling agentic RL hinges on three challenges: efficient learning over nonlinear trajectory histories, leveraging intrinsic model signals, and building scalable high-throughput infrastructure. We address these with a multi-level prefix caching mechanism in the RL training engine, an entropy-based trajectory pruning technique, and an inner implementation of SeamlessFlow[1] architecture that cleanly decouples agents from training while exploiting heterogeneous compute. These innovations together cut scaling costs and enable efficient large-scale RL. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog. claude-code-router is a third-party routing utility that allows Claude Code to flexibly switch between different backend APIs. On the dashScope platform, you can install the claude-code-config extension package, which automatically generates a default configuration for `claude-code-router` with built-in dashScope support. Once the configuration files and plugin directory are generated, the environment required by `ccr` will be ready. If needed, you can still manually edit `~/.claude-code-router/config.json` and the files under `~/.claude-code-router/plugins/` to customize the setup. Finally, simply start `ccr` to run Claude Code and seamlessly connect it with the powerful coding capabilities of KAT-Dev-32B. Happy coding!

NaNK
76
0

Qwopus3.5-27B-v3-AWQ-4bit

NaNK
license:apache-2.0
75
2

OpenReasoning-Nemotron-7B-AWQ

Method Quantised using vllm-project/llm-compressor and the following configs:

NaNK
dataset:mit-han-lab/pile-val-backup
73
0

InternVL3_5-8B-AWQ-4bit

NaNK
license:apache-2.0
72
0

Qwen3-4B-Thinking-2507-AWQ-8bit

NaNK
license:apache-2.0
59
2

Seed-OSS-36B-Instruct-AWQ-4bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Prerequisite To have implementations, please install transformers from source: You can get to know us better through the following channels👇 > [!NOTE] > This model card is dedicated to the `Seed-OSS-36B-Instruct` model. News - [2025/08/20]🔥We release `Seed-OSS-36B-Base` (both with and without synthetic data versions) and `Seed-OSS-36B-Instruct`. Introduction Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. Although trained with only 12T tokens, Seed-OSS achieves excellent performance on several popular open benchmarks. We release this series of models to the open-source community under the Apache-2.0 license. > [!NOTE] > Seed-OSS is primarily optimized for international (i18n) use cases. Key Features - Flexible Control of Thinking Budget: Allowing users to flexibly adjust the reasoning length as needed. This capability of dynamically controlling the reasoning length enhances inference efficiency in practical application scenarios. - Enhanced Reasoning Capability: Specifically optimized for reasoning tasks while maintaining balanced and excellent general capabilities. - Agentic Intelligence: Performs exceptionally well in agentic tasks such as tool-using and issue resolving. - Research-Friendly: Given that the inclusion of synthetic instruction data in pre-training may affect the post-training research, we released pre-trained models both with and without instruction data, providing the research community with more diverse options. - Native Long Context: Trained with up-to-512K long context natively. Seed-OSS adopts the popular causal language model architecture with RoPE, GQA attention, RMSNorm and SwiGLU activation. | | | |:---:|:---:| | | Seed-OSS-36B | | Parameters | 36B | | Attention | GQA | | Activation Function | SwiGLU | | Number of Layers | 64 | | Number of QKV Heads | 80 / 8 / 8 | | Head Size | 128 | | Hidden Size | 5120 | | Vocabulary Size | 155K | | Context Length | 512K | | RoPE Base Frequency | 1e7 | Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as `Seed-OSS-36B-Base`. We also release `Seed-OSS-36B-Base-woSyn` trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data. Benchmark Seed1.6-Base Qwen3-30B-A3B-Base-2507 Qwen2.5-32B-Base Seed-OSS-36B-Base ( w/ syn. ) Seed-OSS-36B-Base-woSyn ( w/o syn. ) - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Benchmark Seed1.6-Thinking-0715 OAI-OSS-20B Qwen3-30B-A3B-Thinking-2507 Qwen3-32B Gemma3-27B Seed-OSS-36B-Instruct GPQA-D 80.7 72.2 (71.5) 71.4 (73.4) 66.7 (68.4) 42.4 71.4 LiveCodeBench v6 (02/2025-05/2025) 66.8 63.8 60.3 (66) 53.4 - 67.4 SWE-Bench Verified (OpenHands) 41.8 (60.7) 31 23.4 - 56 SWE-Bench Verified (AgentLess 410) 48.4 - 33.5 39.7 - 47 - Bold denotes open-source SOTA. Underlined indicates the second place in the open-source model. - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Some results have been omitted due to the failure of the evaluation run. - The results of Gemma3-27B are sourced directly from its technical report. - Generation configs for Seed-OSS-36B-Instruct: temperature=1.1, topp=0.95. Specifically, for Taubench, temperature=1, topp=0.7. > [!NOTE] > We recommend sampling with `temperature=1.1` and `topp=0.95`. Users can flexibly specify the model's thinking budget. The figure below shows the performance curves across different tasks as the thinking budget varies. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. Here is an example with a thinking budget set to 512: during the reasoning process, the model periodically triggers self-reflection to estimate the consumed and remaining budget, and delivers the final response once the budget is exhausted or the reasoning concludes. If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. Download Seed-OSS checkpoint to `./Seed-OSS-36B-Instruct` Transformers The `generate.py` script provides a simple interface for model inference with configurable options. Key Parameters | Parameter | Description | |-----------|-------------| | `--modelpath` | Path to the pretrained model directory (required) | | `--prompts` | Input prompts (default: sample cooking/code questions) | | `--maxnewtokens` | Maximum tokens to generate (default: 4096) | | `--attnimplementation` | Attention mechanism: `flashattention2` (default) or `eager` | | `--loadin4bit/8bit` | Enable 4-bit/8-bit quantization (reduces memory usage) | | `--thinkingbudget` | Thinking budget in tokens (default: -1 for unlimited budget) | - First install vLLM with Seed-OSS support version: License This project is licensed under Apache-2.0. See the LICENSE flie for details. Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

NaNK
license:apache-2.0
53
0

ERNIE-4.5-21B-A3B-Thinking-AWQ-8bit

Over the past three months, we have continued to scale the thinking capability of ERNIE-4.5-21B-A3B, improving both the quality and depth of reasoning, thereby advancing the competitiveness of ERNIE lightweight models in complex reasoning tasks. We are pleased to introduce ERNIE-4.5-21B-A3B-Thinking, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, text generation, and academic benchmarks that typically require human expertise. Efficient tool usage capabilities. Enhanced 128K long-context understanding capabilities. > [!NOTE] > Note: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token. The following are the model configuration details: |Key|Value| |-|-| |Modality|Text| |Training Stage|Posttraining| |Params(Total / Activated)|21B / 3B| |Layers|28| |Heads(Q/KV)|20 / 4| |Text Experts(Total / Activated)|64 / 6| |Vision Experts(Total / Activated)|64 / 6| |Shared Experts|2| |Context Length|131072| > [!NOTE] > To align with the wider community, this model releases Transformer-style weights. Both PyTorch and PaddlePaddle ecosystem tools, such as vLLM, transformers, and FastDeploy, are expected to be able to load and run this model. Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository. Note: 80GB x 1 GPU resources are required. Deploying this model requires FastDeploy version 2.2. The ERNIE-4.5-21B-A3B-Thinking model supports function call. The `reasoning-parser` and `tool-call-parser` for vLLM Ernie are currently under development. Note: You'll need the`transformers`library (version 4.54.0 or newer) installed to use this model. The following contains a code snippet illustrating how to use the model generate content based on given inputs. The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved. If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:

NaNK
license:apache-2.0
51
2

InternVL3_5-8B-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479) [\[📜 InternVL3.5\]](https://huggingface.co/papers/2508.18265) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/) We introduce InternVL3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks—narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. > Hatched bars represent closed-source commercial models. We report average scores on a set of multimodal general, reasoning, text, and agentic benchmarks: MMBench v1.1 (en), MMStar,BLINK, HallusionBench, AI2D, OCRBench, MMVet, MME-RealWorld (en), MVBench, VideoMME, MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista, MATH500, AIME24, AIME25, GPQA, MMLU-Pro, GAOKAO, IFEval, SGP-Bench, VSI-Bench, ERQA, SpaCE-10, and OmniSpatial. In the following table, we provide an overview of the InternVL3.5 series. To maintain consistency with earlier generations, we provide two model formats: the GitHub format, consistent with prior releases, and the HF format, aligned with the official Transformers standard. > If you want to convert the checkpoint between these two formats, please refer to the scripts about custom2hf and hf2custom. | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | | InternVL3.5-1B | 0.3B | 0.8B | 1.1B | 🤗 link | 🤖 link | | InternVL3.5-2B | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | 5.5B | 235.1B | 240.7B-A28B | 🤗 link | 🤖 link | | Model | #Vision Param | #Language Param | #Total Param | HF Link | ModelScope Link | | ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | InternVL3.5-1B-HF | 0.3B | 0.8B | 1.1B | 🤗 link | 🤖 link | | InternVL3.5-2B-HF | 0.3B | 2.0B | 2.3B | 🤗 link | 🤖 link | | InternVL3.5-4B-HF | 0.3B | 4.4B | 4.7B | 🤗 link | 🤖 link | | InternVL3.5-8B-HF | 0.3B | 8.2B | 8.5B | 🤗 link | 🤖 link | | InternVL3.5-14B-HF | 0.3B | 14.8B | 15.1B | 🤗 link | 🤖 link | | InternVL3.5-38B-HF | 5.5B | 32.8B | 38.4B | 🤗 link | 🤖 link | | InternVL3.5-20B-A4B-HF | 0.3B | 20.9B | 21.2B-A4B | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-HF | 0.3B | 30.5B | 30.8B-A3B | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-HF | 5.5B | 235.1B | 240.7B-A28B | 🤗 link | 🤖 link | > We conduct the evaluation with VLMEvalkit. To enable the Thinking mode of our model, please set the system prompt to R1SYSTEMPROMPT. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Our training pipeline comprises four stages: Multimodal Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), and Cascade Reinforcement Learning (CascadeRL). In CascadeRL, we first fine-tune the model using Mixed Preference Optimization (MPO) under an offline RL setting, followed by GSPO under an oneline RL setting. For the Flash version of InternVL3.5, we additionally introduce a lightweight training stage, termed Visual Consistency Learning (ViCO), which reduces the token cost required to represent an image patch. Here, we also open-source the model weights after different training stages for potential research usage. If you're unsure which version to use, please select the one without any suffix, as it has completed the full training pipeline. | Model | Training Pipeline | HF Link | ModelScope Link | | -------------------------------- | --------------------- | --------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | | InternVL3.5-1B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-1B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-1B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-1B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-2B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-2B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-2B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-2B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-4B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-4B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-4B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-4B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-8B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-8B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-8B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-8B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-14B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-14B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-14B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-14B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-30B-A3B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-38B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-38B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-38B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-38B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Pretrained | CPT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-Instruct | CPT + SFT | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B-MPO | CPT + SFT + MPO | 🤗 link | 🤖 link | | InternVL3.5-241B-A28B | CPT + SFT + CascadeRL | 🤗 link | 🤖 link | The Flash version of our model will be released as soon as possible. `InternVL3.5`: This series of models follow the "ViT–MLP–LLM" paradigm adopted in previous versions of InternVL. We initialize the language model using the Qwen3 series and GPT-OSS, and the vision encoder using InternViT-300M and InternViT-6B. The Dynamic High Resolution strategy introduced in InternVL1.5 is also retained in our design. `InternVL3.5-Flash`: Compared to InternVL3.5, InternVL3.5-Flash further integrates the Visual Resolution Router (ViR), thus yielding a series of efficient variants friendly suitable for resource-constrained scenarios. Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM). In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens. For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly. Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5. During the pre-training stage, we update all model parameters jointly using the combination of large-scale text and multimodal corpora. Specifically, given an arbitrary training sample consisting of a multimodal token sequence \\(\mathbf{x}=\left(x1, x2, \ldots, xL\right)\\), the next token prediction (NTP) loss is calculated on each text token as follows: $$ \mathcal{L}{i}=-\log p\theta\left(xi \mid x1, \ldots, x{i-1}\right), $$ where \\(xi\\) is the predicted token and prefix tokens in \\(\{x1, x2, \ldots, x{i-1}\}\\) can be either text tokens or image tokens. Notably, for conversation samples, only response tokens are included for the calculation of the loss. Additionally, to mitigate bias toward either longer or shorter responses during training, we adopt the square averaging to re-weight the NTP loss as follows: $$ \mathcal{L}{i}^{'} = \frac{wi}{\sumj wj} \cdot \mathcal{L}i, \quad wi = \frac{1}{N^{0.5}}, $$ where \\(N\\) denotes the number of tokens in the training sample on which the loss needs to be calculated. The random JPEG compression is also included to enhance the model's real-world performance. During the SFT phase, we adopt the same objective as in the pre-training stage and use the square-root averaging strategy to calculate the final loss. In this stage, the context window is set to 32K tokens to adapt long-context information. Compared to InternVL3, the SFT stage of InternVL3.5 contains more high-quality and diverse training data derived from three sources: (1) Instruction-following data from InternVL3, which are reused to preserve broad coverage of vision–language tasks. (2) Multimodal reasoning data in the "Thinking" mode, which are included to instill long-thinking capabilities in the model. To construct such data, we first use InternVL3-78B to describe the image and then input the description into DeepSeek-R1 to sample rollouts with detailed reasoning processes. Rollouts with an incorrect final answer are filtered out. The questions in these datasets cover various expert domains, such as mathematics and scientific disciplines, thereby strengthening performance on different reasoning tasks. (3) Capability-expansion datasets, which endow InternVL3.5 with new skills, including GUI-based interaction, embodied interaction, and scalable vect Cascade RL aims to combine the benefits of offline RL and online RL to progressively facilitate the post-training of MLLMs in an efficient manner. Specifically, we first fine-tune the model using an offline RL algorithm as an efficient warm-up stage to reach a satisfied results, which can guarantee the high-quality rollouts for the latter stage. Subsequently, we employ an online RL algorithm to further refine the output distribution based on rollouts generated by the model itself. Compared to the single offline or online RL stage, our cascaded RL achieves significant performance improvements at a fraction of the GPU time cost. During the offline RL stage, we employ mixed preference optimization (MPO) to fine-tune the model. Specifically, the training objective of MPO is a combination of preference loss \\(\mathcal{L}{p}\\), quality loss \\(\mathcal{L}{q}\\), and generation loss \\(\mathcal{L}{g}\\), which can be formulated as follows: $$ \mathcal{L}{\text{MPO}}= w{p} \mathcal{L}{p} + w{q} \mathcal{L}{q} + w{g} \mathcal{L}{g} , $$ where \\(w{}\\) represents the weight assigned to each loss component. The DPO loss, BCO loss, and LM loss serve as the preference loss, quality loss, and generation loss, respectively. During the online RL stage, we employ GSPO, without reference model constraints, as our online RL algorithm, which we find more effective in training both dense and mixture-of-experts (MoE) models. Similar to GRPO, the advantage is defined as the normalized reward across responses sampled from the same query. The training objective of GSPO is given by: $$ \mathcal{L}{\mathrm{GSPO}}(\theta)=\mathbb{E}{x \sim \mathcal{D},\left\{yi\right\}{i=1}^G \sim \pi{\theta \text { old }}(\cdot \mid x)}\left[\frac{1}{G} \sum{i=1}^G \min \left(si(\theta) \widehat{A}i, \operatorname{clip}\left(si(\theta), 1-\varepsilon, 1+\varepsilon\right) \widehat{A}i\right)\right], $$ where the importance sampling ratio is defined as the geometric mean of the per-token ratios. > Please see our paper for more technical and experimental details. We further include ViCO as an additional training stage to integrate the visual resolution router (ViR) into InternVL3.5, thereby reducing the inference cost of InternVL3.5. The obtained efficient version of InternVL3.5 are termed as InternVL3.5-Flash. In particular, ViCO comprises two stages: `Consistency training`: In this stage, the entire model is trained to minimize the divergence between response distributions conditioned on visual tokens with different compression rates. In practice, we introduce an extra reference model, which is frozen and initialized with InternVL3.5. Given a sample, each image patch is represented as either 256 or 64 tokens, and the training objective is defined as follows: $$ \mathcal{L}\text{ViCO} = \mathbb{E}{\xi \sim \mathcal{R}} \Bigg \frac{1}{N} \sum{i=1}^{N} \mathrm{KL} \Big( \pi{\theta{ref}}\left(yi \mid y{ Please see [our paper for more technical and experimental details. Test-time scaling (TTS) has been empirically demonstrated as an effective approach to enhance the reasoning capabilities of LLMs and MLLMs, particularly for complex tasks necessitating multi-step inference. In this work, we implement a comprehensive test-time scaling approach that simultaneously improves reasoning depth (i.e., deep thinking) and breadth (i.e., parallel thinking). `Deep Thinking`: By activating the Thinking mode, we guide the model to deliberately engage in step-by-step reasoning (i.e., decomposing complex problems into logical steps and validating intermediate conclusions) prior to generating the final answer. This approach systematically improves the logical structure of solutions for complex problems, particularly those requiring multi-step inference, and enhances reasoning depth. `Parallel Thinking`: Following InternVL3, for reasoning tasks, we adopt the Best-of-N (BoN) strategy by employing VisualPRM-v1.1 as the critic model to select the optimal response from multiple reasoning candidates. This approach improves reasoning breadth. > Notably, unless otherwise specified, the experimental results reported in our paper are obtained without applying TTS. Thus far, we have only applied TTS to reasoning benchmarks, since we found that the model already exhibits strong perception and understanding capabilities, and initiating TTS yields no significant improvement. In multimodal inference, the vision encoder and language model have distinct computational characteristics. The vision encoder that transforms images into semantic features is highly parallelizable and does not rely on long-term history state. In contrast, the language model adopts the inference in an autoregressive manner, which requires previous states to compute the next one. This sequential property makes the language part more sensitive to memory bandwidth and latency. When MLLMs are deployed online at scale, the vision and language models often block each other, thus incurring additional inference cost. This effect becomes more pronounced with larger vision models or higher-resolution images. As shown in the Figure above, we propose decoupled vision-language deployment (DvD) to address this issue by separating vision and language processing, with a particular focus on optimizing the prefilling stage. The vision subsystem batches and processes images to produce compact feature embeddings, which are then transmitted to the language subsystem for fusion with the text context prior to decoding. This separation alleviates blocking and brings multimodal prefilling performance closer to that of pure language models. In our system implementation, the ViT and MLP (and ViR for InternVL3.5-Flash) are deployed on the vision server, while the language server executes only the LLM. The communication is unidirectional, transmitting BF16 visual features over TCP, with RDMA optionally employed to achieve higher transmission speed. Vision processing, feature transmission, and language processing are organized into an asynchronous three-stage pipeline, enabling overlapped execution and minimizing pipeline stalls. DvD increases GPU utilization and processing efficiency on the vision side, while enabling the language server to focus exclusively on the LLM’s prefilling and decoding without being blocked by vision computation. This design leads to improved throughput and responsiveness. Moreover, the architecture supports independent hardware cost optimization for the vision and language modules, and facilitates the seamless integration of new modules without requiring modifications to the language server deployment. Multi-Image Understanding & Real-World Comprehension Comprehensive Multimodal Understanding & Multimodal Hallucination Evaluation We provide an example code to run `InternVL3.5-8B` using `transformers`. Please note that our models with up to 30B parameters can be deployed on a single A100 GPU, while the 38B model requires two A100 GPUs and the 235B model requires eight A100 GPUs. > In most cases, both LMDeploy and vLLM can be used for model deployment. However, for InternVL3.5-20B-A4B, we recommend using vLLM since lmdeploy has not yet supported GPT-OSS. > Please use transformers>=4.52.1 to ensure the model works normally. For the 20B version of our model, transformers>=4.55.0 is required. To enable thinking mode, please set the system prompt to our Thinking System Prompt. When enabling Thinking mode, we recommend setting `dosample=True` and `temperature=0.6` to mitigate undesired repetition. Besides this method, you can also use the following code to get streamed output. Many repositories now support fine-tuning of the InternVL series models, including InternVL, SWIFT, XTuner, and others. Please refer to their documentation for more details on fine-tuning. LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs. LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline. When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased. Conducting inference with batch prompts is quite straightforward; just place them within a list structure: There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface. LMDeploy's `apiserver` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup: To use the OpenAI-style interface, you need to install OpenAI: This project is released under the apache-2.0 License. This project uses the pre-trained Qwen3 as a component, which is licensed under the apache-2.0 License. If you find this project useful in your research, please consider citing:

NaNK
license:apache-2.0
51
0

OpenReasoning-Nemotron-32B-AWQ-4bit

Method Quantised using casper-hansen/AutoAWQ and the following configs:

NaNK
license:apache-2.0
49
2

Hermes-4-14B-AWQ-4bit

NaNK
license:apache-2.0
45
1

ERNIE-4.5-21B-A3B-Thinking-AWQ-4bit

NaNK
license:apache-2.0
41
2

Hermes-4-70B-AWQ-8bit

Hermes 4 70B is a frontier, hybrid-mode reasoning model based on Llama-3.1-70B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-70B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-70B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

NaNK
llama
39
1

command-a-reasoning-08-2025-AWQ-4bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Cohere Labs Command A Reasoning is an open weights research release of a 111 billion parameter model optimized for tool use, agentic, and multilingual use cases with reasoning capabilities. The model can be used both with reasoning on for increased performance or with reasoning off for lower latency responses, using the ‘reasoning’ parameter. Point of Contact: Cohere Labs License:CC-BY-NC, requires also adhering to Cohere Lab's Acceptable Use Policy Model: command-a-reasoning-08-2025 Model Size: 111 billion parameters Context length: 256K For more details about this model, please check out our blog post. You can try out Cohere Labs Command A Reasoning before downloading the weights in our hosted Hugging Face Space. Please install transformers from the source repository that includes the necessary changes for this model. As a result, you should get an output that looks like this, where the thinking is generated between the ` ` and ` `: Reasoning can be turned off by passing `reasoning=False` to `applychattemplate`. The default value is `True`. Model Architecture: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. The model features three layers with sliding window attention (window size 4096) and RoPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence. Languages covered: The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese, Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. Context Length: Command A Reasoning supports a context length of 256K & 32K output length. Command A Reasoning has been specifically trained with conversational tool use capabilities. This allows the model to interact with external tools like APIs, databases, or search engines. Tool use with Command A Reasoning is supported through chat templates in Transformers. We recommend providing tool descriptions using JSON schema. If the model generates a plan and tool calls, you should add them to the chat history like so: and then call the tool and append the result, as a dictionary, with the tool role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the Command A prompt format docs and the Transformers tool use documentation. Tool Use with citations [CLICK TO EXPAND] Optionally, one can ask the model to include grounding spans (citations) in its response to indicate the source of the information, by using enablecitations=True in tokenizer.applychattemplate(). The generation would look like this: When citations are turned on, the model associates pieces of texts (called "spans") with those specific tool results that support them (called "sources"). Command A uses a pair of tags " " and " " to indicate when a span can be grounded onto a list of sources, listing them out in the closing tag. For example, " span " means that "span" is supported by result 1 and 2 from "toolcallid=0" as well as result 0 from "toolcallid=1". Sources from the same tool call are grouped together and listed as "{toolcallid}:[{list of result indices}]", before they are joined together by ",". For errors or additional questions about details in this model card, contact [email protected] We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 111 billion parameter model to researchers all over the world. This model is governed by a CC-BY-NC, requires also adhering to Cohere Lab's Acceptable Use Policy If you are interested in commercial use, please contact Cohere’s Sales team. You can try Command A Reasoning in the playground here. You can also use it in our dedicated Hugging Face Space here.

NaNK
license:apache-2.0
38
2

Seed-OSS-36B-Instruct-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Prerequisite To have implementations, please install transformers from source: You can get to know us better through the following channels👇 > [!NOTE] > This model card is dedicated to the `Seed-OSS-36B-Instruct` model. News - [2025/08/20]🔥We release `Seed-OSS-36B-Base` (both with and without synthetic data versions) and `Seed-OSS-36B-Instruct`. Introduction Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. Although trained with only 12T tokens, Seed-OSS achieves excellent performance on several popular open benchmarks. We release this series of models to the open-source community under the Apache-2.0 license. > [!NOTE] > Seed-OSS is primarily optimized for international (i18n) use cases. Key Features - Flexible Control of Thinking Budget: Allowing users to flexibly adjust the reasoning length as needed. This capability of dynamically controlling the reasoning length enhances inference efficiency in practical application scenarios. - Enhanced Reasoning Capability: Specifically optimized for reasoning tasks while maintaining balanced and excellent general capabilities. - Agentic Intelligence: Performs exceptionally well in agentic tasks such as tool-using and issue resolving. - Research-Friendly: Given that the inclusion of synthetic instruction data in pre-training may affect the post-training research, we released pre-trained models both with and without instruction data, providing the research community with more diverse options. - Native Long Context: Trained with up-to-512K long context natively. Seed-OSS adopts the popular causal language model architecture with RoPE, GQA attention, RMSNorm and SwiGLU activation. | | | |:---:|:---:| | | Seed-OSS-36B | | Parameters | 36B | | Attention | GQA | | Activation Function | SwiGLU | | Number of Layers | 64 | | Number of QKV Heads | 80 / 8 / 8 | | Head Size | 128 | | Hidden Size | 5120 | | Vocabulary Size | 155K | | Context Length | 512K | | RoPE Base Frequency | 1e7 | Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as `Seed-OSS-36B-Base`. We also release `Seed-OSS-36B-Base-woSyn` trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data. Benchmark Seed1.6-Base Qwen3-30B-A3B-Base-2507 Qwen2.5-32B-Base Seed-OSS-36B-Base ( w/ syn. ) Seed-OSS-36B-Base-woSyn ( w/o syn. ) - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Benchmark Seed1.6-Thinking-0715 OAI-OSS-20B Qwen3-30B-A3B-Thinking-2507 Qwen3-32B Gemma3-27B Seed-OSS-36B-Instruct GPQA-D 80.7 72.2 (71.5) 71.4 (73.4) 66.7 (68.4) 42.4 71.4 LiveCodeBench v6 (02/2025-05/2025) 66.8 63.8 60.3 (66) 53.4 - 67.4 SWE-Bench Verified (OpenHands) 41.8 (60.7) 31 23.4 - 56 SWE-Bench Verified (AgentLess 410) 48.4 - 33.5 39.7 - 47 - Bold denotes open-source SOTA. Underlined indicates the second place in the open-source model. - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Some results have been omitted due to the failure of the evaluation run. - The results of Gemma3-27B are sourced directly from its technical report. - Generation configs for Seed-OSS-36B-Instruct: temperature=1.1, topp=0.95. Specifically, for Taubench, temperature=1, topp=0.7. > [!NOTE] > We recommend sampling with `temperature=1.1` and `topp=0.95`. Users can flexibly specify the model's thinking budget. The figure below shows the performance curves across different tasks as the thinking budget varies. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. Here is an example with a thinking budget set to 512: during the reasoning process, the model periodically triggers self-reflection to estimate the consumed and remaining budget, and delivers the final response once the budget is exhausted or the reasoning concludes. If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. Download Seed-OSS checkpoint to `./Seed-OSS-36B-Instruct` Transformers The `generate.py` script provides a simple interface for model inference with configurable options. Key Parameters | Parameter | Description | |-----------|-------------| | `--modelpath` | Path to the pretrained model directory (required) | | `--prompts` | Input prompts (default: sample cooking/code questions) | | `--maxnewtokens` | Maximum tokens to generate (default: 4096) | | `--attnimplementation` | Attention mechanism: `flashattention2` (default) or `eager` | | `--loadin4bit/8bit` | Enable 4-bit/8-bit quantization (reduces memory usage) | | `--thinkingbudget` | Thinking budget in tokens (default: -1 for unlimited budget) | - First install vLLM with Seed-OSS support version: License This project is licensed under Apache-2.0. See the LICENSE flie for details. Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

NaNK
license:apache-2.0
37
0

Ring-mini-2.0-AWQ-8bit

NaNK
license:mit
37
0

K2-Think-AWQ-4bit

NaNK
license:apache-2.0
35
0

LFM2-8B-A1B-AWQ-8bit

LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of our first MoE based on LFM2, with 8.3B total parameters and 1.5B active parameters. - LFM2-8B-A1B is the best on-device MoE in terms of both quality (comparable to 3-4B dense models) and speed (faster than Qwen3-1.7B). - Code and knowledge capabilities are significantly improved compared to LFM2-2.6B. - Quantized variants fit comfortably on high-end phones, tablets, and laptops. Find more information about LFM2-8B-A1B in our blog post. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | LFM2-8B-A1B | | --------------------- | ----------------------------- | | Total parameters | 8.3B | | Active parameters | 1.5B | | Layers | 24 (18 conv + 6 attn) | | Context length | 32,768 tokens | | Vocabulary size | 65,536 | | Training precision| Mixed BF16/FP8 | | Training budget | 12 trillion tokens | | License | LFM Open License v1.0 | Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 18 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` from source as follows: Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. You can run the model in `vLLM` by building from source: You can run LFM2 with llama.cpp using its GGUF checkpoint. Find more information in the model card. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | | | DPO (TRL) | Preference alignment with Direct Preference Optimization (DPO) using TRL. | | Compared to similar-sized models, LFM2-8B-A1B displays strong performance in instruction following and math while also running significantly faster. | Model | MMLU | MMLU-Pro | GPQA | IFEval | IFBench | Multi-IF | |---|---|---|---|---|---|---| | LFM2-8B-A1B | 64.84 | 37.42 | 29.29 | 77.58 | 25.85 | 58.19 | | LFM2-2.6B | 64.42 | 25.96 | 26.57 | 79.56 | 22.19 | 60.26 | | Llama-3.2-3B-Instruct | 60.35 | 22.25 | 30.6 | 71.43 | 20.78 | 50.91 | | SmolLM3-3B | 59.84 | 23.90 | 26.31 | 72.44 | 17.93 | 58.86 | | gemma-3-4b-it | 58.35 | 34.76 | 29.51 | 76.85 | 23.53 | 66.61 | | Qwen3-4B-Instruct-2507 | 72.25 | 52.31 | 34.85 | 85.62 | 30.28 | 75.54 | | granite-4.0-h-tiny | 66.79 | 32.03 | 26.46 | 81.06 | 18.37 | 52.99 | | Model | GSM8K | GSMPlus | MATH 500 | MATH Lvl 5 | MGSM | MMMLU | |---|---|---|---|---|---|---| | LFM2-8B-A1B | 84.38 | 64.76 | 74.2 | 62.38 | 72.4 | 55.26 | | LFM2-2.6B | 82.41 | 60.75 | 63.6 | 54.38 | 74.32 | 55.39 | | Llama-3.2-3B-Instruct | 75.21 | 38.68 | 41.2 | 24.06 | 61.68 | 47.92 | | SmolLM3-3B | 81.12 | 58.91 | 73.6 | 51.93 | 68.72 | 50.02 | | gemma-3-4b-it | 89.92 | 68.38 | 73.2 | 52.18 | 87.28 | 50.14 | | Qwen3-4B-Instruct-2507 | 68.46 | 56.16 | 85.6 | 73.62 | 81.76 | 60.67 | | granite-4.0-h-tiny | 82.64 | 59.14 | 58.2 | 36.11 | 73.68 | 56.13 | | Model | Active params | LCB v6 | LCB v5 | HumanEval+ | Creative Writing v3 | |----------------------------|---------------|---------------|---------------|--------------------|-----------------------------| | LFM2-8B-A1B | 1.5B | 21.04% | 21.36% | 69.51% | 44.22% | | Gemma-3-1b-it | 1B | 4.27% | 4.43% | 37.20% | 41.67% | | Granite-4.0-h-tiny | 1B | 26.73% | 27.27% | 73.78% | 32.60% | | Llama-3.2-1B-Instruct | 1.2B | 4.08% | 3.64% | 23.17% | 31.43% | | Qwen2.5-1.5B-Instruct | 1.5B | 11.18% | 10.57% | 48.78% | 22.18% | | Qwen3-1.7B (/nothink) | 1.7B | 24.07% | 26.48% | 60.98% | 31.56% | | LFM2-2.6B | 2.6B | 14.41% | 14.43% | 57.93% | 38.79% | | SmolLM3-3B | 3.1B | 19.05% | 19.20% | 60.37% | 36.44% | | Llama-3.2-3B-Instruct | 3.2B | 11.47% | 11.48% | 24.06% | 38.84% | | Qwen3-4B (/nothink) | 4B | 36.11% | 38.64% | 71.95% | 37.49% | | Qwen3-4B-Instruct-2507 | 4B | 48.72% | 50.80% | 82.32% | 51.71% | | Gemma-3-4b-it | 4.3B | 18.86% | 19.09% | 62.8% | 68.56% | LFM2-8B-A1B is significantly faster than models with a similar number of active parameters, like Qwen3-1.7B. The following plots showcase the performance of different models under int4 quantization with int8 dynamic activations on the AMD Ryzen AI 9 HX 370 CPU, using 16 threads. The results are obtained using our internal XNNPACK-based inference stack, and a custom CPU MoE kernel. If you are interested in custom solutions with edge deployment, please contact our sales team.

NaNK
32
0

cwm-AWQ-4bit

NaNK
llama
27
0

KAT-Dev-72B-Exp-AWQ-8bit

🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake platform for free. Highlights KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy ⚡ — when evaluated strictly with the SWE-agent scaffold. KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers. We rewrote the attention kernel and redesigned the training engine for shared prefix trajectories to achieve highly efficient RL training, especially for scaffolds leveraging context management. Furthermore, to prevent exploration collapse observed in RL training, we reshaped advantage distribution based on pass rates: amplifying the advantage scale of highly exploratory groups while reducing that of low-exploration ones.

NaNK
license:apache-2.0
25
1

LFM2-8B-A1B-AWQ-4bit

LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of our first MoE based on LFM2, with 8.3B total parameters and 1.5B active parameters. - LFM2-8B-A1B is the best on-device MoE in terms of both quality (comparable to 3-4B dense models) and speed (faster than Qwen3-1.7B). - Code and knowledge capabilities are significantly improved compared to LFM2-2.6B. - Quantized variants fit comfortably on high-end phones, tablets, and laptops. Find more information about LFM2-8B-A1B in our blog post. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | LFM2-8B-A1B | | --------------------- | ----------------------------- | | Total parameters | 8.3B | | Active parameters | 1.5B | | Layers | 24 (18 conv + 6 attn) | | Context length | 32,768 tokens | | Vocabulary size | 65,536 | | Training precision| Mixed BF16/FP8 | | Training budget | 12 trillion tokens | | License | LFM Open License v1.0 | Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 18 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` from source as follows: Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. You can run the model in `vLLM` by building from source: You can run LFM2 with llama.cpp using its GGUF checkpoint. Find more information in the model card. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | | | DPO (TRL) | Preference alignment with Direct Preference Optimization (DPO) using TRL. | | Compared to similar-sized models, LFM2-8B-A1B displays strong performance in instruction following and math while also running significantly faster. | Model | MMLU | MMLU-Pro | GPQA | IFEval | IFBench | Multi-IF | |---|---|---|---|---|---|---| | LFM2-8B-A1B | 64.84 | 37.42 | 29.29 | 77.58 | 25.85 | 58.19 | | LFM2-2.6B | 64.42 | 25.96 | 26.57 | 79.56 | 22.19 | 60.26 | | Llama-3.2-3B-Instruct | 60.35 | 22.25 | 30.6 | 71.43 | 20.78 | 50.91 | | SmolLM3-3B | 59.84 | 23.90 | 26.31 | 72.44 | 17.93 | 58.86 | | gemma-3-4b-it | 58.35 | 34.76 | 29.51 | 76.85 | 23.53 | 66.61 | | Qwen3-4B-Instruct-2507 | 72.25 | 52.31 | 34.85 | 85.62 | 30.28 | 75.54 | | granite-4.0-h-tiny | 66.79 | 32.03 | 26.46 | 81.06 | 18.37 | 52.99 | | Model | GSM8K | GSMPlus | MATH 500 | MATH Lvl 5 | MGSM | MMMLU | |---|---|---|---|---|---|---| | LFM2-8B-A1B | 84.38 | 64.76 | 74.2 | 62.38 | 72.4 | 55.26 | | LFM2-2.6B | 82.41 | 60.75 | 63.6 | 54.38 | 74.32 | 55.39 | | Llama-3.2-3B-Instruct | 75.21 | 38.68 | 41.2 | 24.06 | 61.68 | 47.92 | | SmolLM3-3B | 81.12 | 58.91 | 73.6 | 51.93 | 68.72 | 50.02 | | gemma-3-4b-it | 89.92 | 68.38 | 73.2 | 52.18 | 87.28 | 50.14 | | Qwen3-4B-Instruct-2507 | 68.46 | 56.16 | 85.6 | 73.62 | 81.76 | 60.67 | | granite-4.0-h-tiny | 82.64 | 59.14 | 58.2 | 36.11 | 73.68 | 56.13 | | Model | Active params | LCB v6 | LCB v5 | HumanEval+ | Creative Writing v3 | |----------------------------|---------------|---------------|---------------|--------------------|-----------------------------| | LFM2-8B-A1B | 1.5B | 21.04% | 21.36% | 69.51% | 44.22% | | Gemma-3-1b-it | 1B | 4.27% | 4.43% | 37.20% | 41.67% | | Granite-4.0-h-tiny | 1B | 26.73% | 27.27% | 73.78% | 32.60% | | Llama-3.2-1B-Instruct | 1.2B | 4.08% | 3.64% | 23.17% | 31.43% | | Qwen2.5-1.5B-Instruct | 1.5B | 11.18% | 10.57% | 48.78% | 22.18% | | Qwen3-1.7B (/nothink) | 1.7B | 24.07% | 26.48% | 60.98% | 31.56% | | LFM2-2.6B | 2.6B | 14.41% | 14.43% | 57.93% | 38.79% | | SmolLM3-3B | 3.1B | 19.05% | 19.20% | 60.37% | 36.44% | | Llama-3.2-3B-Instruct | 3.2B | 11.47% | 11.48% | 24.06% | 38.84% | | Qwen3-4B (/nothink) | 4B | 36.11% | 38.64% | 71.95% | 37.49% | | Qwen3-4B-Instruct-2507 | 4B | 48.72% | 50.80% | 82.32% | 51.71% | | Gemma-3-4b-it | 4.3B | 18.86% | 19.09% | 62.8% | 68.56% | LFM2-8B-A1B is significantly faster than models with a similar number of active parameters, like Qwen3-1.7B. The following plots showcase the performance of different models under int4 quantization with int8 dynamic activations on the AMD Ryzen AI 9 HX 370 CPU, using 16 threads. The results are obtained using our internal XNNPACK-based inference stack, and a custom CPU MoE kernel. If you are interested in custom solutions with edge deployment, please contact our sales team.

NaNK
25
1

cogito-v2-preview-llama-109B-MoE-AWQ-4bit

The Cogito v2 LLMs are instruction tuned generative models. All models are released under an open license for commercial use. - Cogito v2 models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models). - The LLMs are trained using Iterated Distillation and Amplification (IDA) - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement. - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts. - In both standard and reasoning modes, Cogito v2-preview models outperform their size equivalent counterparts on common industry benchmarks. - This model is trained in over 30 languages and supports long contexts (upto 10M tokens). Evaluations Here is the model performance on some standard industry benchmarks: For detailed evaluations, please refer to the Blog Post. Usage Here is a snippet below for usage with Transformers: Implementing extended thinking - By default, the model will answer in the standard mode. - To enable thinking, you can do any one of the two methods: - Set `enablethinking=True` while applying the chat template. - Add a specific system prompt, along with prefilling the response with "\ \n". NOTE: Unlike Cogito v1 models, we initiate the response with "\ \n" at the beginning of every output when reasoning is enabled. This is because hybrid models can be brittle at times ( \n" ensures that the model does indeed respect thinking. Method 1 - Set enablethinking=True in the tokenizer If you are using Huggingface tokenizers, then you can simply use add the argument `enablethinking=True` to the tokenization (this option is added to the chat template). Method 2 - Add a specific system prompt, along with prefilling the response with "\ \n". To enable thinking using this method, you need to do two parts - Step 1 - Simply use this in the system prompt `systeminstruction = 'Enable deep thinking subroutine.'` If you already have a systeminstruction, then use `systeminstruction = 'Enable deep thinking subroutine.' + '\n\n' + systeminstruction`. Step 2 - Prefil the response with the tokens `" \n"`. Similarly, if you have a system prompt, you can append the `DEEPTHINKINGINSTRUCTION` to the beginning in this way - Tool Calling Cogito models support tool calling (single, parallel, multiple and parallelmultiple) both in standard and extended thinking mode. You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat: License This repository and the model weights are licensed under the Llama 4 Community License Agreement (Llama models' default license agreement). Contact If you would like to reach out to our team, send an email to [email protected].

NaNK
llama4
24
0

Qwen3-Omni-30B-A3B-Captioner-AWQ-8bit

NaNK
24
0

granite-4.0-h-micro-AWQ-8bit

Model Summary: Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
24
0

command-a-reasoning-08-2025-AWQ-8bit

Method vllm-project/llm-compressor and nvidia/Llama-Nemotron-Post-Training-Dataset were used to quantize the original model. For further quantization arguments and configurations information, please visit config.json and recipe.yaml. Cohere Labs Command A Reasoning is an open weights research release of a 111 billion parameter model optimized for tool use, agentic, and multilingual use cases with reasoning capabilities. The model can be used both with reasoning on for increased performance or with reasoning off for lower latency responses, using the ‘reasoning’ parameter. Point of Contact: Cohere Labs License:CC-BY-NC, requires also adhering to Cohere Lab's Acceptable Use Policy Model: command-a-reasoning-08-2025 Model Size: 111 billion parameters Context length: 256K For more details about this model, please check out our blog post. You can try out Cohere Labs Command A Reasoning before downloading the weights in our hosted Hugging Face Space. Please install transformers from the source repository that includes the necessary changes for this model. As a result, you should get an output that looks like this, where the thinking is generated between the ` ` and ` `: Reasoning can be turned off by passing `reasoning=False` to `applychattemplate`. The default value is `True`. Model Architecture: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety. The model features three layers with sliding window attention (window size 4096) and RoPE for efficient local context modeling and relative positional encoding. A fourth layer uses global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence. Languages covered: The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese, Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian. Context Length: Command A Reasoning supports a context length of 256K & 32K output length. Command A Reasoning has been specifically trained with conversational tool use capabilities. This allows the model to interact with external tools like APIs, databases, or search engines. Tool use with Command A Reasoning is supported through chat templates in Transformers. We recommend providing tool descriptions using JSON schema. If the model generates a plan and tool calls, you should add them to the chat history like so: and then call the tool and append the result, as a dictionary, with the tool role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the Command A prompt format docs and the Transformers tool use documentation. Tool Use with citations [CLICK TO EXPAND] Optionally, one can ask the model to include grounding spans (citations) in its response to indicate the source of the information, by using enablecitations=True in tokenizer.applychattemplate(). The generation would look like this: When citations are turned on, the model associates pieces of texts (called "spans") with those specific tool results that support them (called "sources"). Command A uses a pair of tags " " and " " to indicate when a span can be grounded onto a list of sources, listing them out in the closing tag. For example, " span " means that "span" is supported by result 1 and 2 from "toolcallid=0" as well as result 0 from "toolcallid=1". Sources from the same tool call are grouped together and listed as "{toolcallid}:[{list of result indices}]", before they are joined together by ",". For errors or additional questions about details in this model card, contact [email protected] We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 111 billion parameter model to researchers all over the world. This model is governed by a CC-BY-NC, requires also adhering to Cohere Lab's Acceptable Use Policy If you are interested in commercial use, please contact Cohere’s Sales team. You can try Command A Reasoning in the playground here. You can also use it in our dedicated Hugging Face Space here.

NaNK
license:apache-2.0
23
0

gpt-oss-120b-BF16

NaNK
license:apache-2.0
22
0

XBai-o4-AWQ-4bit

NaNK
license:apache-2.0
21
3

cogito-v2-preview-llama-70B-AWQ-4bit

NaNK
llama
20
0

cogito-v2-preview-llama-109B-MoE-GPTQ-4bit

NaNK
llama4
18
0

Llama-3_3-Nemotron-Super-49B-v1_5-GPTQ-8bit

NaNK
llama-3
14
0

Hermes-4-14B-AWQ-8bit

NaNK
license:apache-2.0
14
0

cwm-AWQ-8bit

NaNK
llama
14
0

MindLink-72B-0801-AWQ-4bit

NaNK
license:apache-2.0
10
1

Apertus-8B-Instruct-2509-GPTQ-8bit

NaNK
license:apache-2.0
9
0

Light-IF-32B-AWQ

Method Quantised using vllm-project/llm-compressor, nvidia/Llama-Nemotron-Post-Training-Dataset and the following configs: Evaluation |Model|SuperClue|IFEval|CFBench|IFBench| | ---- | ---- | ---- | ---- | ---- | |Qwen3-32B|0.234|0.877|0.823|0.384| |Qwen3-235B-A22B|0.244|0.882|0.834|0.423| |Qwen3-235B-A22B-Thinking-2507|0.434|-|-|0.475| |DeepSeek-R1-0528|0.436|0.863|0.827|0.415| |Doubao-seed-1-6-thinking-250615|0.362|0.832|0.82|0.477| |ChatGPT-4o-latest|0.260|0.836|0.807|0.365| |Deepseek-v3-250324|0.306|0.859|0.833|0.405| |Doubao-1.5-pro-32k-250115|0.285|0.889|0.797|0.375| |Kimi-K2|0.227|0.921|0.820|0.395| | Light-IF-32B (ours) 🤗 |0.575|0.938|0.85|0.575| Introduction Instruction following is a core ability of large language models (LLMs), but performance remains inconsistent, especially on complex tasks. We identify lazy reasoning during the thinking stage as a key cause of poor instruction adherence. To address this, we propose a framework that promotes rigorous reasoning through previewing and self-checking. Our method begins by generating instruction data with complex constraints, filtering out samples that are too easy or too difficult. We then use rejection sampling to build a small but high-quality dataset for model adaptation. Training involves entropy-preserving supervised fine-tuning (Entropy-SFT) and token-wise entropy-adaptive reinforcement learning (TEA-RL), guided by rule-based multidimensional rewards. This approach encourages models to plan ahead and verify their outputs, fostering more generalizable reasoning abilities. Experiments show consistent improvements across model sizes. Notably, our 32B model outperforms both larger open-source models like DeepSeek-R1 and closed-source models like ChatGPT-4o on challenging instruction-following benchmarks. The following contains a code snippet illustrating how to use the model generate content based on given inputs. Got it, let's tackle this request. First, I need to remember all the requirements: 15 lines total, 4 stanzas, each stanza starts with a highlighted subtitle, no more than 300 words. First, I should figure out how to split 15 lines into 4 stanzas. Since 4 stanzas, the most even split would be maybe 4, 4, 4, 3 lines. That adds up to 15. Let's check: 4+4+4+3=15. Perfect. Next, the subtitles. They need to be highlighted, so maybe italics or bold. The user said "highlighted subtitle," so I'll use italics for consistency. The subtitles should be meaningful and set the theme of each stanza. Let's think of a cohesive theme—maybe something natural, like a journey through the day or seasons, or growth. Let's go with "growth" since it's easy to split into stanzas. First stanza: Maybe "seed" or "beginning." Let's say Beneath the Soil as the first subtitle. That's a strong start for growth. Second stanza: Next step in growth—Sprout Through Cracks. That's good, continues the growth theme, adds imagery. Third stanza: Maybe Bend, But Not Break—growth often involves resilience. Fourth stanza: Bloom Into Sun—culmination, reaching the light. Now, I need to make sure each stanza starts with the subtitle, then the lines. Let's draft the first stanza: Beneath the Soil a heartbeat of green, wrapped tight in dark, waiting for the world to breathe. Wait, that's 4 lines (subtitle plus 3 lines). Wait, no: the user said each stanza begins with a highlighted subtitle. Does the subtitle count as a line? The original request says "total of 15 lines" and "4 stanzas, each beginning with a highlighted subtitle." So probably, the subtitles are part of the lines, so each stanza's first line is the subtitle, then the rest. Wait, but the example: if the first stanza is 4 lines, then subtitle (1) + 3 lines. Let's check: 4 stanzas, total lines: 15. Let's split: 4 stanzas: maybe stanza 1: 4 lines, stanza 2: 4, stanza 3: 4, stanza 4: 3? Wait, 4+4+4+3=15. That's a possible split. But need to make sure the flow is natural. Let's adjust the number of lines per stanza: for example, stanza 1: 4 lines (subtitle + 3), stanza 2: 4 (subtitle + 3), stanza 3: 4, stanza 4: 3. Let's try drafting stanza 1: Beneath the Soil a heartbeat of green, wrapped tight in dark, waiting for the world to breathe. Stanza 2: Sprout Through Cracks tip tilts toward light, pushing through stone and time, a thin stem holding sky. Stanza 3: Bend, But Not Break when wind screams like fear, I arch, then rise again, stronger where I was torn. Stanza 4: Bloom Into Sun petals open like questions, each one asking to stay, here. Wait, total lines: 4+4+4+3=15. Perfect. Now, check the word count. Let's count: Stanza 1: 22 words (approximately: "Beneath the Soil" is 3, then 3 lines: 22-3=19? Wait, need to count exactly. Let's do it: Beneath the Soil: 3 words. a heartbeat of green: 4 wrapped tight in dark: 4 waiting for the world to breathe: 6 Total stanza 1: 3+4+4+6=17 words. Stanza 2: 3+4+4+5=16 (wait: Sprout Through Cracks: 3 words. tip tilts toward light: 4 pushing through stone and time: 5 a thin stem holding sky: 5. Wait, no: 3+4+5+5=17? Wait, no: 3+4+5+5? Wait, no, each line is a separate count. Let's do it again: Stanza 2: 3 (subtitle) + 4 +5 +5 = 17? Wait, no: "tip tilts toward light" is 4 words. "pushing through stone and time" is 5 words. "a thin stem holding sky" is 5 words. So 3+4+5+5=17. Stanza 3: Bend, But Not Break (4) + 5 (when wind screams like fear) + 5 (I arch, then rise again) + 5 (stronger where I was torn) = 4+5+5+5=19. Stanza 4: Bloom Into Sun (3) + 4 (petals open like questions) + 5 (each one asking to stay) + 1 (here) = 3+4+5+1=13. Total words: 17+17+19+13=66 words. Way under 300. Good. Now, check the flow: the theme of growth from seed to sprout to resilience to bloom. Makes sense. But maybe adjust for more imagery or emotional depth. Let's revise some lines for better rhythm or impact. Stanza 1: Beneath the Soil a heartbeat of green, wrapped tight in dark, waiting for the world to breathe. Stanza 2: Sprout Through Cracks tip tilts toward the faint, pushing through stone and time, a fragile stem holding sky. Better: "faint" instead of "light" for more subtlety, "fragile" adds depth. Stanza 3: Bend, But Not Break when wind screams like rage, I arch, then rise again, stronger where I was torn. "Rage" instead of "fear" to make the wind more destructive. Stanza 4: Bloom Into Sun petals open like hearts, each one asking to be seen, here. "Like hearts" adds emotion, "asking to be seen" is more vivid than "stay". Stanza 2: "faint" is 1 word instead of 5 (wait, no: "toward the faint" is 3 words, so total stanza 2: 3 (subtitle) + 5 (tip tilts toward the faint) + 5 (pushing through stone and time) + 5 (a fragile stem holding sky) = 3+5+5+5=18. Stanza 3: 4 (subtitle) + 5 (when wind screams like rage) + 5 (I arch, then rise again) + 5 (stronger where I was torn) = 4+5+5+5=19. Stanza 4: 3 (subtitle) + 4 (petals open like hearts) + 6 (each one asking to be seen) + 1 (here) = 3+4+6+1=14. Now, check the rhythm. Each line is roughly similar in length, maybe iambic tetrameter or free verse. Since the user didn't specify rhythm, free verse is fine, but should flow naturally. Another revision: maybe make some lines more concise or impactful. For example, stanza 1: "waiting for the world to breathe" is good. Stanza 2: "toward the faint" could be "toward faint light" to be more specific. So: tip tilts toward faint light, (5 words). Then stanza 2 word count: 3+5+5+5=18. Stanza 3: "when wind screams like rage" is 5 words. Maybe "when winds roar like rage" for more intensity. So: 5 words. Then stanza 3 word count: 4+5+5+5=19. Stanza 4: "each one asking to be seen" is 6 words. Maybe change to "each begging to be seen" for more impact. So: 5 words. Then stanza 4 word count: 3+4+6+1=14. Now, make sure each stanza's subtitle is highlighted. The user said "highlighted subtitle," so in the response, we'll use italics. Another revision for flow: Maybe change the order of stanzas to make the growth more natural. Wait, the order is: beneath soil, sprout, bend, bloom. That's logical. Stanza 2: 5+3=4? Wait, no: each stanza is subtitle (1 line) plus number of lines. Wait, no: the user said "4 stanzas, each beginning with a highlighted subtitle." So the total number of lines includes the subtitles. So if each stanza has 4 lines, then 4 stanzas would be 16 lines. But the user wants 15. So one stanza has 3 lines. Wait, earlier I had stanza 4: 3 lines (1 subtitle +2). So let's adjust stanza 4 to have 3 lines. For example: Bloom Into Sun petals open like hearts, each begging to be seen. That's 3 lines (1 subtitle +2). Then the previous stanza (stanza 3) is 4 lines: subtitle +3. But wait, no: let's do this again. The total lines are 15. So if stanza 1 has 4 lines (1+3), stanza 2 has 4 (1+3), stanza 3 has 4 (1+3), stanza 4 has 3 (1+2). That's 4+4+4+3=15. So let's revise stanza 4: Bloom Into Sun petals open like hearts, each begging to be seen. 1. Beneath the Soil 2. a heartbeat of green, 3. wrapped tight in dark, 4. waiting for the world to breathe. 5. Sprout Through Cracks 6. tip tilts toward faint light, 7. pushing through stone and time, 8. a fragile stem holding sky. 9. Bend, But Not Break 10. when winds roar like rage, 11. I arch, then rise again, 12. stronger where I was torn. 13. Bloom Into Sun 14. petals open like hearts, 15. each begging to be seen. Now, check the flow: Each stanza builds on the growth theme. Another revision for better imagery: Maybe change "faint light" to "faint light through cracks" to tie back to the subtitle. But that adds words. Since we have to keep word count low, maybe keep it as is. Stanza 2: "tip tilts toward faint light,"—good, shows struggle. Stanza 3: "when winds roar like rage,"—good, shows adversity. Stanza 4: "petals open like hearts,"—adds emotion, "each begging to be seen,"—relatable. I think this works. Now, let's write the final version. Wait, let's check for any redundancy or weak lines. "Wrapped tight in dark" is good. "Waiting for the world to breathe"—beautiful, personifies the seed. Stanza 2: "pushing through stone and time"—good, combines physical and temporal struggle. Stanza 3: "stronger where I was torn"—great, conveys resilience. Stanza 4: "each begging to be seen"—adds a personal touch, makes the bloom more meaningful. Is there anything to change? Maybe make the transition between stanzas smoother? Not necessary; the theme is clear. Another check: Each subtitle is highlighted. Yes, we'll use italics. Stanza 2: "tip tilts toward faint light" is 5 words. "pushing through stone and time" is 5 words. "a fragile stem holding sky" is 5 words. So 3+5+5+5=18. Stanza 3: Bend, But Not Break (4) + 5 (when wind screams like fear) + 5 (I arch, then rise again) + 5 (stronger where I was torn) = 4+5+5+5=19. Stanza 4: Bloom Into Sun (3) + 4 (petals open like questions) + 5 (each one asking to stay) = 3+4+5=12. Total words: 17+18+19+12=66 words. Way under 300. Good. Beneath the Soil a heartbeat of green, wrapped tight in dark, waiting for the world to breathe. Sprout Through Cracks tip tilts toward faint light, pushing through stone and time, a fragile stem holding sky. Bend, But Not Break when winds roar like rage, I arch, then rise again, stronger where I was torn. Bloom Into Sun petals open like hearts, each begging to be seen. For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint.

NaNK
license:apache-2.0
8
1

UIGEN-X-32B-0727-AWQ

UIGEN-X-32B-0727 Reasoning Only UI Generation Model > Tesslate's Reasoning Only UI generation model built on Qwen3-32B architecture. Trained to systematically plan, architect, and implement complete user interfaces across modern development stacks. Live Examples: https://uigenoutput.tesslate.com Discord Community: https://discord.gg/EcCpcTv93U Website: https://tesslate.com UIGEN-X-32B-0727 implements Reasoning Only from the Qwen3 family - combining systematic planning with direct implementation. The model follows a structured thinking process: 1. Problem Analysis — Understanding requirements and constraints 2. Architecture Planning — Component structure and technology decisions 3. Design System Definition — Color schemes, typography, and styling approach 4. Implementation Strategy — Step-by-step code generation with reasoning This hybrid approach enables both thoughtful planning and efficient code generation, making it suitable for complex UI development tasks. UIGEN-X-32B-0727 supports 26 major categories spanning frameworks and libraries across 7 platforms: Web Frameworks - React: Next.js, Remix, Gatsby, Create React App, Vite - Vue: Nuxt.js, Quasar, Gridsome - Angular: Angular CLI, Ionic Angular - Svelte: SvelteKit, Astro - Modern: Solid.js, Qwik, Alpine.js - Static: Astro, 11ty, Jekyll, Hugo Styling Systems - Utility-First: Tailwind CSS, UnoCSS, Windi CSS - CSS-in-JS: Styled Components, Emotion, Stitches - Component Systems: Material-UI, Chakra UI, Mantine - Traditional: Bootstrap, Bulma, Foundation - Design Systems: Carbon Design, IBM Design Language - Framework-Specific: Angular Material, Vuetify, Quasar UI Component Libraries - React: shadcn/ui, Material-UI, Ant Design, Chakra UI, Mantine, PrimeReact, Headless UI, NextUI, DaisyUI - Vue: Vuetify, PrimeVue, Quasar, Element Plus, Naive UI - Angular: Angular Material, PrimeNG, ng-bootstrap, Clarity Design - Svelte: Svelte Material UI, Carbon Components Svelte - Headless: Radix UI, Reach UI, Ariakit, React Aria State Management - React: Redux Toolkit, Zustand, Jotai, Valtio, Context API - Vue: Pinia, Vuex, Composables - Angular: NgRx, Akita, Services - Universal: MobX, XState, Recoil Animation Libraries - React: Framer Motion, React Spring, React Transition Group - Vue: Vue Transition, Vueuse Motion - Universal: GSAP, Lottie, CSS Animations, Web Animations API - Mobile: React Native Reanimated, Expo Animations Icon Systems Lucide, Heroicons, Material Icons, Font Awesome, Ant Design Icons, Bootstrap Icons, Ionicons, Tabler Icons, Feather, Phosphor, React Icons, Vue Icons Web Development Complete coverage of modern web development from simple HTML/CSS to complex enterprise applications. Mobile Development - React Native: Expo, CLI, with navigation and state management - Flutter: Cross-platform mobile with Material and Cupertino designs - Ionic: Angular, React, and Vue-based hybrid applications Desktop Applications - Electron: Cross-platform desktop apps (Slack, VSCode-style) - Tauri: Rust-based lightweight desktop applications - Flutter Desktop: Native desktop performance Python Applications - Web UI: Streamlit, Gradio, Flask, FastAPI - Desktop GUI: Tkinter, PyQt5/6, Kivy, wxPython, Dear PyGui Development Tools Build tools, bundlers, testing frameworks, and development environments. 26 Languages and Approaches: JavaScript, TypeScript, Python, Dart, HTML5, CSS3, SCSS, SASS, Less, PostCSS, CSS Modules, Styled Components, JSX, TSX, Vue SFC, Svelte Components, Angular Templates, Tailwind, PHP UIGEN-X-32B-0727 includes 21 distinct visual style categories that can be applied to any framework: Modern Design Styles - Glassmorphism: Frosted glass effects with blur and transparency - Neumorphism: Soft, extruded design elements - Material Design: Google's design system principles - Fluent Design: Microsoft's design language Traditional & Classic - Skeuomorphism: Real-world object representations - Swiss Design: Clean typography and grid systems - Bauhaus: Functional, geometric design principles Contemporary Trends - Brutalism: Bold, raw, unconventional layouts - Anti-Design: Intentionally imperfect, organic aesthetics - Minimalism: Essential elements only, generous whitespace Thematic Styles - Cyberpunk: Neon colors, glitch effects, futuristic elements - Dark Mode: High contrast, reduced eye strain - Retro-Futurism: 80s/90s inspired futuristic design - Geocities/90s Web: Nostalgic early web aesthetics Experimental - Maximalism: Rich, layered, abundant visual elements - Madness/Experimental: Unconventional, boundary-pushing designs - Abstract Shapes: Geometric, non-representational elements Basic Structure To achieve the best results, use this prompting structure below: UIGEN-X-32B-0727 supports function calling for dynamic asset integration and enhanced development workflows. Dynamic Asset Loading: - Fetch relevant images during UI generation - Generate realistic content for components - Create cohesive color palettes from images - Optimize assets for web performance Multi-Step Development: - Plan application architecture - Generate individual components - Integrate components into pages - Apply consistent styling and theming - Test responsive behavior Content-Aware Design: - Adapt layouts based on content types - Optimize typography for readability - Create responsive image galleries - Generate accessible alt text Rapid Prototyping - Quick mockups for client presentations - A/B testing different design approaches - Concept validation with interactive prototypes Production Development - Component library creation - Design system implementation - Template and boilerplate generation Educational & Learning - Teaching modern web development - Framework comparison and evaluation - Best practices demonstration Enterprise Solutions - Dashboard and admin panel generation - Internal tool development - Legacy system modernization Hardware - GPU: 8GB+ VRAM recommended (RTX 3080/4070 or equivalent) - RAM: 16GB system memory minimum - Storage: 20GB for model weights and cache Software - Python: 3.8+ with transformers, torch, unsloth - Node.js: For running generated JavaScript/TypeScript code - Browser: Modern browser for testing generated UIs Integration - Compatible with HuggingFace transformers - Supports GGML/GGUF quantization - Works with text-generation-webui - API-ready for production deployment - Token Usage: Reasoning process increases token consumption - Complex Logic: Focuses on UI structure rather than business logic - Real-time Features: Generated code requires backend integration - Testing: Output may need manual testing and refinement - Accessibility: While ARIA-aware, manual a11y testing recommended Discord: https://discord.gg/EcCpcTv93U Website: https://tesslate.com Examples: https://uigenoutput.tesslate.com Join our community to share creations, get help, and contribute to the ecosystem. Built with Reasoning Only capabilities from Qwen3, UIGEN-X-32B-0727 represents a comprehensive approach to AI-driven UI development across the entire modern web development ecosystem.

NaNK
license:apache-2.0
7
0

KAT-V1-40B-AWQ-4bit

NaNK
6
2

K2-Think-AWQ-8bit

NaNK
license:apache-2.0
6
1

cogito-v2-preview-llama-70B-GPTQ-4bit

NaNK
llama
6
0

MetaStone-S1-32B-AWQ-4bit

NaNK
license:apache-2.0
5
0

Datarus-R1-14B-preview-AWQ-4bit

NaNK
license:apache-2.0
5
0

WebDancer-32B-AWQ

Method Quantised using casper-hansen/AutoAWQ and the following configs:

NaNK
license:apache-2.0
4
0

DeepSeek-V3.1-Base-BF16

4
0

OpenCodeReasoning-Nemotron-1.1-32B-AWQ-4bit

NaNK
license:apache-2.0
3
2

Jan-v1-2509-AWQ-4bit

NaNK
license:apache-2.0
3
1

MindLink-32B-0801-AWQ-4bit

NaNK
license:apache-2.0
3
1

OpenReasoning-Nemotron-32B-W8A8-INT8-Dynamic

NaNK
license:apache-2.0
3
0

KAT-V1-40B-GPTQ-8bit

NaNK
license:apache-2.0
3
0

Kimi-Dev-72B-AWQ-8bit

NaNK
license:mit
3
0

Jan-v1-4B-AWQ-8bit

[](https://github.com/menloresearch/deep-research) [](https://opensource.org/licenses/Apache-2.0) [](https://jan.ai/) Overview Jan-v1 is the first release in the Jan Family, designed for agentic reasoning and problem-solving within the Jan App. Based on our Lucy model, Jan-v1 achieves improved performance through model scaling. Jan-v1 uses the Qwen3-4B-thinking model to provide enhanced reasoning capabilities and tool utilization. This architecture delivers better performance on complex agentic tasks. Question Answering (SimpleQA) For question-answering, Jan-v1 shows a significant performance gain from model scaling, achieving 91.1% accuracy. The 91.1% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach. These benchmarks evaluate the model's conversational and instructional capabilities. Jan-v1 is optimized for direct integration with the Jan App. Simply select the model from the Jan App interface for immediate access to its full capabilities. - Discussions: HuggingFace Community - Jan App: Learn more about the Jan App at jan.ai

NaNK
license:apache-2.0
2
1

Ling-mini-2.0-AWQ-8bit

NaNK
license:mit
2
0

Jan-v1-4B-AWQ-4bit

[](https://github.com/menloresearch/deep-research) [](https://opensource.org/licenses/Apache-2.0) [](https://jan.ai/) Overview Jan-v1 is the first release in the Jan Family, designed for agentic reasoning and problem-solving within the Jan App. Based on our Lucy model, Jan-v1 achieves improved performance through model scaling. Jan-v1 uses the Qwen3-4B-thinking model to provide enhanced reasoning capabilities and tool utilization. This architecture delivers better performance on complex agentic tasks. Question Answering (SimpleQA) For question-answering, Jan-v1 shows a significant performance gain from model scaling, achieving 91.1% accuracy. The 91.1% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach. These benchmarks evaluate the model's conversational and instructional capabilities. Jan-v1 is optimized for direct integration with the Jan App. Simply select the model from the Jan App interface for immediate access to its full capabilities. - Discussions: HuggingFace Community - Jan App: Learn more about the Jan App at jan.ai

NaNK
license:apache-2.0
1
2

II-Search-4B-AWQ-8bit

A 4B parameter language model specialized in information seeking, multi-hop reasoning, and web-integrated search, achieving state-of-the-art performance among models of similar size. II-Search-4B is a 4B parameter language model based on Qwen3-4B, fine-tuned specifically for information seeking tasks and web-integrated reasoning. It excels at complex multi-hop information retrieval, fact verification, and comprehensive report generation. - Enhanced tool usage for web search and webpage visits - Multi-hop reasoning capabilities with sophisticated planning - Verified information retrieval with cross-checking - Strong performance on factual QA benchmarks - Comprehensive report generation for research queries Our training process consisted of three key phases: We used a distillation approach from larger models (Qwen3-235B) to generate reasoning paths with function calling on multi-hop datasets. This established the base capabilities for tool use. - Creating synthetic problems requiring more reasoning turns, inspired by Random Walk algorithm - Improving reasoning thought patterns for more efficient and cleaner reasoning paths - Filtering to keep only high-quality reasoning traces (correct answers with proper reasoning) - STORM-inspired techniques to enhance comprehensive report generation We trained the model using reinforcement learning - Used dataset: dgslibisey/MuSiQue - Incorporated our in-house search database (containing Wiki data, Fineweb data, and ArXiv data) | Benchmark | Qwen3-4B | Jan-4B | WebSailor-3B | II-Search-4B | | --- | --- | --- | --- | --- | | OpenAI/SimpleQA | 76.8 | 80.1 | 81.8 | 91.8 | | Google/Frames | 30.7 | 24.8 | 34.0 | 67.5 | | Seal0 | 6.31 | 2.7 | 1.8 | 22.5 | | | Qwen3-4B | Jan-4B | WebSailor-3B | II-Search-4B | | --- | --- | --- | --- | --- | | # Search | 1.0 | 0.9 | 2.1 | 2.2 | | # Visit | 0.1 | 1.9 | 6.4 | 3.5 | | # Total Tools | 1.1 | 2.8 | 8.5 | 5.7 | All benchmark traces from models can be found at: https://huggingface.co/datasets/II-Vietnam/Inspect-Search-Models-Benchmarking-Result - Information seeking and factual question answering - Research assistance and comprehensive report generation - Fact verification and evidence-based reasoning - Educational and research applications requiring factual accuracy Usage To deploy and interact with the II-Search-4B model effectively, follow these options: 1. Serve the model using vLLM or SGLang Use the following command to serve the model with vLLM (adjust parameters as needed for your hardware setup): This configuration enables distributed tensor parallelism across 8 GPUs, reasoning capabilities, custom RoPE scaling for extended context, and a maximum context length of 131,072 tokens. Equip the served model with websearch and webvisit tools to enable internet-aware functionality. Alternatively, use a middleware like MCP for tool integration—see this example repository: https://github.com/hoanganhpham1006/mcp-server-template. Host on macOS with MLX for local use As an alternative for Apple Silicon users, host the quantized II-Search-4B-MLX version on your Mac. Then, interact with it via user-friendly interfaces like LM Studio or Ollama Desktop. - For a query that you need to find a short and accurate answer. Add the following phrase: "\n\nPlease reason step-by-step and put the final answer within \\\\boxed{}."

NaNK
1
1

Datarus-R1-14B-preview-AWQ-8bit

NaNK
license:apache-2.0
1
1

Jan-v1-2509-AWQ-8bit

NaNK
license:apache-2.0
1
1

II-Search-4B-AWQ-4bit

Method Quantised using vllm-project/llm-compressor, nvidia/Llama-Nemotron-Post-Training-Dataset and the following configs: A 4B parameter language model specialized in information seeking, multi-hop reasoning, and web-integrated search, achieving state-of-the-art performance among models of similar size. II-Search-4B is a 4B parameter language model based on Qwen3-4B, fine-tuned specifically for information seeking tasks and web-integrated reasoning. It excels at complex multi-hop information retrieval, fact verification, and comprehensive report generation. - Enhanced tool usage for web search and webpage visits - Multi-hop reasoning capabilities with sophisticated planning - Verified information retrieval with cross-checking - Strong performance on factual QA benchmarks - Comprehensive report generation for research queries Our training process consisted of three key phases: We used a distillation approach from larger models (Qwen3-235B) to generate reasoning paths with function calling on multi-hop datasets. This established the base capabilities for tool use. - Creating synthetic problems requiring more reasoning turns, inspired by Random Walk algorithm - Improving reasoning thought patterns for more efficient and cleaner reasoning paths - Filtering to keep only high-quality reasoning traces (correct answers with proper reasoning) - STORM-inspired techniques to enhance comprehensive report generation We trained the model using reinforcement learning - Used dataset: dgslibisey/MuSiQue - Incorporated our in-house search database (containing Wiki data, Fineweb data, and ArXiv data) | Benchmark | Qwen3-4B | Jan-4B | WebSailor-3B | II-Search-4B | | --- | --- | --- | --- | --- | | OpenAI/SimpleQA | 76.8 | 80.1 | 81.8 | 91.8 | | Google/Frames | 30.7 | 24.8 | 34.0 | 67.5 | | Seal0 | 6.31 | 2.7 | 1.8 | 22.5 | | | Qwen3-4B | Jan-4B | WebSailor-3B | II-Search-4B | | --- | --- | --- | --- | --- | | # Search | 1.0 | 0.9 | 2.1 | 2.2 | | # Visit | 0.1 | 1.9 | 6.4 | 3.5 | | # Total Tools | 1.1 | 2.8 | 8.5 | 5.7 | All benchmark traces from models can be found at: https://huggingface.co/datasets/II-Vietnam/Inspect-Search-Models-Benchmarking-Result - Information seeking and factual question answering - Research assistance and comprehensive report generation - Fact verification and evidence-based reasoning - Educational and research applications requiring factual accuracy Usage To deploy and interact with the II-Search-4B model effectively, follow these options: 1. Serve the model using vLLM or SGLang Use the following command to serve the model with vLLM (adjust parameters as needed for your hardware setup): This configuration enables distributed tensor parallelism across 8 GPUs, reasoning capabilities, custom RoPE scaling for extended context, and a maximum context length of 131,072 tokens. Equip the served model with websearch and webvisit tools to enable internet-aware functionality. Alternatively, use a middleware like MCP for tool integration—see this example repository: https://github.com/hoanganhpham1006/mcp-server-template. Host on macOS with MLX for local use As an alternative for Apple Silicon users, host the quantized II-Search-4B-MLX version on your Mac. Then, interact with it via user-friendly interfaces like LM Studio or Ollama Desktop. - For a query that you need to find a short and accurate answer. Add the following phrase: "\n\nPlease reason step-by-step and put the final answer within \\\\boxed{}."

NaNK
1
0

DeepSeek-V3.1-BF16

NaNK
license:mit
1
0

DeepSeek-V3.1-Terminus-BF16

NaNK
license:mit
1
0

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-AWQ-INT8-INT4

NaNK
license:apache-2.0
0
1