bullerwins

145 models β€’ 4 total models in database
Sort by:

Wan2.2-I2V-A14B-GGUF

--- license: apache-2.0 language: - en - zh pipeline_tag: image-to-video base_model: - Wan-AI/Wan2.2-I2V-A14B ---

NaNK
license:apache-2.0
192,933
216

Wan2.2-T2V-A14B-GGUF

You need to download both a high-noise model and a low-noise model. High noise is used for the first steps and the low-noise for the details. Place them on ComfyUI/models/unet Example workflow included in the files, just drag it into ComfyUI πŸ’œ Wan &nbsp&nbsp | &nbsp&nbsp πŸ–₯️ GitHub &nbsp&nbsp | &nbsp&nbspπŸ€— Hugging Face &nbsp&nbsp | &nbsp&nbspπŸ€– ModelScope &nbsp&nbsp | &nbsp&nbsp πŸ“‘ Technical Report &nbsp&nbsp | &nbsp&nbsp πŸ“‘ Blog &nbsp&nbsp | &nbsp&nbspπŸ’¬ WeChat Group &nbsp&nbsp | &nbsp&nbsp πŸ“– Discord &nbsp&nbsp Wan: Open and Advanced Large-Scale Video Generative Models We are excited to introduce Wan2.2, a major upgrade to our foundational video models. With Wan2.2, we have focused on incorporating the following innovations: - πŸ‘ Effective MoE Architecture: Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost. - πŸ‘ Cinematic-level Aesthetics: Wan2.2 incorporates meticulously curated aesthetic data, complete with detailed labels for lighting, composition, contrast, color tone, and more. This allows for more precise and controllable cinematic style generation, facilitating the creation of videos with customizable aesthetic preferences. - πŸ‘ Complex Motion Generation: Compared to Wan2.1, Wan2.2 is trained on a significantly larger data, with +65.6% more images and +83.2% more videos. This expansion notably enhances the model's generalization across multiple dimensions such as motions, semantics, and aesthetics, achieving TOP performance among all open-sourced and closed-sourced models. - πŸ‘ Efficient High-Definition Hybrid TI2V: Wan2.2 open-sources a 5B model built with our advanced Wan2.2-VAE that achieves a compression ratio of 16Γ—16Γ—4. This model supports both text-to-video and image-to-video generation at 720P resolution with 24fps and can also run on consumer-grade graphics cards like 4090. It is one of the fastest 720P@24fps models currently available, capable of serving both the industrial and academic sectors simultaneously. This repository contains our T2V-A14B model, which supports generating 5s videos at both 480P and 720P resolutions. Built with a Mixture-of-Experts (MoE) architecture, it delivers outstanding video generation quality. On our new benchmark Wan-Bench 2.0, the model surpasses leading commercial models across most key evaluation dimensions. Jul 28, 2025: πŸ‘‹ We've released the inference code and model weights of Wan2.2. Community Works If your research or project builds upon Wan2.1 or Wan2.2, we welcome you to share it with us so we can highlight it for the broader community. πŸ“‘ Todo List - Wan2.2 Text-to-Video - [x] Multi-GPU Inference code of the A14B and 14B models - [x] Checkpoints of the A14B and 14B models - [x] ComfyUI integration - [x] Diffusers integration - Wan2.2 Image-to-Video - [x] Multi-GPU Inference code of the A14B model - [x] Checkpoints of the A14B model - [x] ComfyUI integration - [x] Diffusers integration - Wan2.2 Text-Image-to-Video - [x] Multi-GPU Inference code of the 5B model - [x] Checkpoints of the 5B model - [x] ComfyUI integration - [x] Diffusers integration | Models | Download Links | Description | |--------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------| | T2V-A14B | πŸ€— Huggingface πŸ€– ModelScope | Text-to-Video MoE model, supports 480P & 720P | | I2V-A14B | πŸ€— Huggingface πŸ€– ModelScope | Image-to-Video MoE model, supports 480P & 720P | | TI2V-5B | πŸ€— Huggingface πŸ€– ModelScope | High-compression VAE, T2V+I2V, supports 720P | > πŸ’‘Note: > The TI2V-5B model supports 720P video generation at 24 FPS. This repository supports the `Wan2.2-T2V-A14B` Text-to-Video model and can simultaneously support video generation at 480P and 720P resolutions. To facilitate implementation, we will start with a basic version of the inference process that skips the prompt extension step. > πŸ’‘ This command can run on a GPU with at least 80GB VRAM. > πŸ’‘If you encounter OOM (Out-of-Memory) issues, you can use the `--offloadmodel True`, `--convertmodeldtype` and `--t5cpu` options to reduce GPU memory usage. - Multi-GPU inference using FSDP + DeepSpeed Ulysses We use PyTorch FSDP and DeepSpeed Ulysses to accelerate inference. Extending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension: - Use the Dashscope API for extension. - Apply for a `dashscope.apikey` in advance (EN | CN). - Configure the environment variable `DASHAPIKEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASHAPIURL` to 'https://dashscope-intl.aliyuncs.com/api/v1'. For more detailed instructions, please refer to the dashscope document. - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks. - You can modify the model used for extension with the parameter `--promptextendmodel`. For example: - By default, the Qwen model on HuggingFace is used for this extension. Users can choose Qwen models or other models based on the available GPU memory size. - For text-to-video tasks, you can use models like `Qwen/Qwen2.5-14B-Instruct`, `Qwen/Qwen2.5-7B-Instruct` and `Qwen/Qwen2.5-3B-Instruct`. - For image-to-video tasks, you can use models like `Qwen/Qwen2.5-VL-7B-Instruct` and `Qwen/Qwen2.5-VL-3B-Instruct`. - Larger models generally provide better extension results but require more GPU memory. - You can modify the model used for extension with the parameter `--promptextendmodel` , allowing you to specify either a local model path or a Hugging Face model. For example: We test the computational efficiency of different Wan2.2 models on different GPUs in the following table. The results are presented in the format: Total time (s) / peak GPU memory (GB). > The parameter settings for the tests presented in this table are as follows: > (1) Multi-GPU: 14B: `--ulyssessize 4/8 --ditfsdp --t5fsdp`, 5B: `--ulyssessize 4/8 --offloadmodel True --convertmodeldtype --t5cpu`; Single-GPU: 14B: `--offloadmodel True --convertmodeldtype`, 5B: `--offloadmodel True --convertmodeldtype --t5cpu` (--convertmodeldtype converts model parameter types to config.paramdtype); > (2) The distributed testing utilizes the built-in FSDP and Ulysses implementations, with FlashAttention3 deployed on Hopper architecture GPUs; > (3) Tests were run without the `--usepromptextend` flag; > (4) Reported results are the average of multiple samples taken after the warm-up phase. Wan2.2 builds on the foundation of Wan2.1 with notable improvements in generation quality and model capability. This upgrade is driven by a series of key technical innovations, mainly including the Mixture-of-Experts (MoE) architecture, upgraded training data, and high-compression video generation. Wan2.2 introduces Mixture-of-Experts (MoE) architecture into the video generation diffusion model. MoE has been widely validated in large language models as an efficient approach to increase total model parameters while keeping inference cost nearly unchanged. In Wan2.2, the A14B model series adopts a two-expert design tailored to the denoising process of diffusion models: a high-noise expert for the early stages, focusing on overall layout; and a low-noise expert for the later stages, refining video details. Each expert model has about 14B parameters, resulting in a total of 27B parameters but only 14B active parameters per step, keeping inference computation and GPU memory nearly unchanged. The transition point between the two experts is determined by the signal-to-noise ratio (SNR), a metric that decreases monotonically as the denoising step $t$ increases. At the beginning of the denoising process, $t$ is large and the noise level is high, so the SNR is at its minimum, denoted as ${SNR}{min}$. In this stage, the high-noise expert is activated. We define a threshold step ${t}{moe}$ corresponding to half of the ${SNR}{min}$, and switch to the low-noise expert when $t To validate the effectiveness of the MoE architecture, four settings are compared based on their validation loss curves. The baseline Wan2.1 model does not employ the MoE architecture. Among the MoE-based variants, the Wan2.1 & High-Noise Expert reuses the Wan2.1 model as the low-noise expert while uses the Wan2.2's high-noise expert, while the Wan2.1 & Low-Noise Expert uses Wan2.1 as the high-noise expert and employ the Wan2.2's low-noise expert. The Wan2.2 (MoE) (our final version) achieves the lowest validation loss, indicating that its generated video distribution is closest to ground-truth and exhibits superior convergence. (2) Efficient High-Definition Hybrid TI2V To enable more efficient deployment, Wan2.2 also explores a high-compression design. In addition to the 27B MoE models, a 5B dense model, i.e., TI2V-5B, is released. It is supported by a high-compression Wan2.2-VAE, which achieves a $T\times H\times W$ compression ratio of $4\times16\times16$, increasing the overall compression rate to 64 while maintaining high-quality video reconstruction. With an additional patchification layer, the total compression ratio of TI2V-5B reaches $4\times32\times32$. Without specific optimization, TI2V-5B can generate a 5-second 720P video in under 9 minutes on a single consumer-grade GPU, ranking among the fastest 720P@24fps video generation models. This model also natively supports both text-to-video and image-to-video tasks within a single unified framework, covering both academic research and practical applications. Comparisons to SOTAs We compared Wan2.2 with leading closed-source commercial models on our new Wan-Bench 2.0, evaluating performance across multiple crucial dimensions. The results demonstrate that Wan2.2 achieves superior performance compared to these leading models. Citation If you find our work helpful, please cite us. License Agreement The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations. For a complete list of restrictions and details regarding your rights, please refer to the full text of the license. We would like to thank the contributors to the SD3, Qwen, umt5-xxl, diffusers and HuggingFace repositories, for their open research. Contact Us If you would like to leave a message to our research or product teams, feel free to join our Discord or WeChat groups!

NaNK
license:apache-2.0
55,512
61

FLUX.1-Kontext-dev-GGUF

β€”
16,589
236

MiniMax-M2-GGUF

license:mit
6,641
6

DeepSeek-V3-GGUF

NaNK
llama.cpp
932
102

Meta-Llama-3.1-8B-Instruct-GGUF

NaNK
llama
828
23

Qwen3-4B-Instruct-2507-GGUF

NaNK
license:apache-2.0
496
1

Qwen3-VL-32B-Instruct-GGUF

NaNK
license:apache-2.0
441
1

Hunyuan-A13B-Instruct-GGUF

NaNK
β€”
366
24

Qwen3-14B-GGUF

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-14B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 14.8B - Number of Paramaters (Non-Embedding): 13.2B - Number of Layers: 40 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-14B --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-14B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "ropetype": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
343
0

Qwen3-8B-GGUF

NaNK
license:apache-2.0
257
0

Meta-Llama-3.1-70B-Instruct-GGUF

NaNK
llama
192
10

Qwen3-Coder-30B-A3B-Instruct-GGUF

Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-Coder-30B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
191
0

Athene-70B-GGUF

NaNK
license:cc-by-nc-4.0
171
11

Qwen2.5-7B-Instruct-GGUF

NaNK
license:apache-2.0
171
1

DeepSeek-V2-Chat-0628-GGUF

β€”
154
7

DeepSeek-R1T-Chimera-GGUF

β€”
144
2

Meta-Llama-3.1-405B-Instruct-GGUF

NaNK
llama
119
6

Kimi-Dev-72B-GGUF

NaNK
license:mit
96
6

Qwen3-30B-A3B-Instruct-2507-GGUF

NaNK
license:apache-2.0
91
0

Reflection-Llama-3.1-70B-GGUF

NaNK
base_model:mattshumer/Reflection-Llama-3.1-70B
84
1

L3.1-8B-Celeste-V1.5-GGUF

NaNK
llama-factory
83
0

gemma-3-27b-it-fp8-Dynamic

NaNK
β€”
77
1

L3-Aethora-15B-V2-GGUF

NaNK
base_model:elinas/Llama-3-15B-Instruct-zeroed
75
4

DeepSeek-V2.5-GGUF

NaNK
β€”
68
2

DeepSeek-R1-0528-Qwen3-8B-fp8

NaNK
license:mit
66
1

L3-70B-Euryale-v2.1-GGUF

NaNK
license:cc-by-nc-4.0
53
0

gemma-2-2b-it-GGUF

NaNK
β€”
46
8

DeepSeek-Coder-V2-Instruct-GGUF

β€”
39
1

DeepSeek-R1-0528-Qwen3-8B-GGUF

NaNK
license:mit
38
1

Codestral-22B-v0.1-hf

NaNK
β€”
33
17

DeepSeek-V3.1-BF16

NaNK
license:mit
30
0

Devstral-Small-2505-fp8

NaNK
license:apache-2.0
14
4

GLM-4.5-exl3-4.0bpw

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
14
0

mathstral-7B-v0.1-GGUF

NaNK
license:apache-2.0
13
1

GLM-4.5-exl3-2.76bpw_optim

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
13
0

DeepSeek-R1-0528-GGUF

NaNK
license:mit
12
0

DeepSeek-TNG-R1T2-Chimera-GGUF

β€”
12
0

GLM-4.5-exl3-3.2bpw_optim

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
12
0

GLM-4.5-exl3-3.0bpw

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
11
0

GLM-4.5-exl3-5.0bpw

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
11
0

GLM-4.5-exl3-3.5bpw

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
11
0

GLM-4.5-exl3-2.75bpw

πŸ“ Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
11
0

Qwen3-32B-awq

NaNK
license:apache-2.0
10
3

Qwen3-30B-A3B-Instruct-2507-exl3-4.0bpw

We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
9
2

Big-Tiger-Gemma-27B-v1-exl2_5.0bpw

NaNK
β€”
9
0

Meta-Llama-3.1-70B-Instruct-exl2_4.0bpw

NaNK
llama
9
0

Qwen3-30B-A3B-Instruct-2507-exl3-8.0bpw

NaNK
license:apache-2.0
9
0

Qwen3-30B-A3B-Instruct-2507-exl3-6.0bpw

NaNK
license:apache-2.0
9
0

Qwen3-30B-A3B-Instruct-2507-exl3-5.0bpw

We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
8
0

Qwen3-30B-A3B-Instruct-2507-exl3-3.0bpw

We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
8
0

mathstral-7B-v0.1-exl2_8.0bpw

NaNK
license:apache-2.0
6
0

Llama-3.3-70B-Instruct-exl3-4.03bpw

NaNK
llama
6
0

Qwen3-235B-A22B-Thinking-2507-exl3-4.0bpw

NaNK
license:apache-2.0
6
0

Qwen3-235B-A22B-Thinking-2507-exl3-3.5bpw

NaNK
license:apache-2.0
6
0

Qwen3-235B-A22B-Thinking-2507-exl3-3.0bpw

Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise β€” achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-235B-A22B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
6
0

DeepSeek-R1T-Chimera-bf16

This is the BF16 converted model from FP8 original weights so it can be quantized to GGUF. An open weights model combining the intelligence of R1 with the token efficiency of V3. - Architecture: DeepSeek-MoE Transformer-based language model - Combination Method: Merged model weights from DeepSeek-R1 and DeepSeek-V3 (0324) - Release Date: 2025-04-27

license:mit
5
1

Qwen3-235B-A22B-Instruct-2507-exl3-3.0bpw

NaNK
license:apache-2.0
5
1

Qwen3-235B-A22B-Thinking-2507-exl3-2.5bpw

NaNK
license:apache-2.0
5
1

L3-8B-Stheno-v3.2-exl2_8.0bpw

NaNK
llama
5
0

Big-Tiger-Gemma-27B-v1-exl2_8.0bpw

NaNK
β€”
4
1

c4ai-command-r-08-2024-exl2_3.0bpw

NaNK
license:cc-by-nc-4.0
4
1

Hermes-2-Theta-Llama-3-70B-32k-exl2_5.0bpw

NaNK
llama
4
0

Hunyuan-A13B-Instruct-hf

NaNK
β€”
3
6

Hunyuan-A13B-Instruct

NaNK
β€”
3
1

DeepSeek-TNG-R1T2-Chimera-BF16

Assembly of Experts Chimera model constructed with the DeepSeek R1-0528, R1 and V3-0324 parent models We present our new DeepSeek-TNG R1T2 Chimera 671B model, the first successor to our original DeepSeek R1T Chimera that was released on April 26th. Unlike the original Chimera, which was based on the two parent models V3-0324 and R1, the new Chimera is a Tri-Mind with three parents, namely additionally R1-0528. It is constructed using the Assembly of Experts-method with relatively fine-granular direct brain edits. This more refined assembly allowed, among other improvements, the fixing of the <think> token consistency issue, which was a weakness of R1T and is now solved for R1T2. R1T2 operates at a new sweet spot in intelligence vs. output token length. It appears to be... - about 20% faster than the regular R1, and more than twice as fast as R1-0528 - significantly more intelligent than the regular R1 in benchmarks such as GPQA and AIME-24 - much more intelligent and also think-token consistent compared to the first R1T Chimera 0426 - and generally well-behaved and a nice persona to talk to, even without any system prompt. R1T2 compared... - vs R1: We hope that R1T2 is a very desirable, almost universal better and drop-in replacement for R1 - vs R1-0528: R1T2 is a much cheaper alternative to full R1-0528, if the fullest 0528-level intelligence is not required - vs R1T: R1T2 is usually recommended over R1T, unless the specific personality of R1T was optimal, the think-token issue not important, or R1T's higher speed crucial - vs V3-0324: V3 is so much faster that if you can live with the lower intelligence, take V3, however, if you need reasoning, R1T2 is the go-to model - R1-0528 is thinking much longer, but also is achieving better hard benchmark results than R1T2 - As measured by SpeechMap.ai (courtesy of xlr8harder), R1T2 is significantly more reserved than R1T, but not as much as R1-0528 - Due to the influence of its R1 parent, which does not support function calling, R1T2 is not yet recommended for function-calling intensive applications at this stage (this may be fixed at a later stage) - When switching from R1T to R1T2 development, we changed from AIME24 and MT-Bench to AIME24, AIME25 and GPQA-Diamond for the intelligence score. With the new benchmark set, there is a larger score difference between R1 and the original R1T Chimera than published earlier. For details on the AoE construction process, you can read our Paper on arXiV. - Architecture: DeepSeek-MoE transformer-based language model - Combination Method: Assembly of Experts from the three DeepSeek parent models R1-0528, R1 and V3-0324 - Release Date: 2025-07-02 - Design Team: Robert Dahlke, Henrik Klagges, Benjamin Merkel, Fabian Klemm and David Reiss, Munich, Germany - Extra Thanks: Big thanks to DeepSeek for their great models and open-source generosity, and to the other researchers that have published on model merging methodologies. Use, Out-of-scope Use, Other Limitations, Risks, Recommendations et al. Regarding the R1T/R1T2-Chimeras, we ask you to follow the careful guidelines that Microsoft has created for their "MAI-DS-R1" DeepSeek-based model. These professional guidelines are available here on Hugging Face. Due to the strict new guidelines of the EU AI Act that take effect on August 2nd 2025, we recommend that each R1T/R1T2 user in the EU either familiarizes themselves with these requirements and assess their compliance, or ceases using the model in the EU after August 1st, 2025. Please give us your feedback, especially if you find deficiencies in the model: - Email: [email protected] - X.com: @tngtech

NaNK
license:mit
3
1

Qwen3-235B-A22B-Instruct-2507-exl3-3.5bpw

NaNK
license:apache-2.0
3
1

Llama-3-Instruct-8B-SPPO-Iter3-exl2_4.0bpw

NaNK
llama
3
0

Llama-3-Instruct-8B-SPPO-Iter3-exl2_5.0bpw

NaNK
llama
3
0

mathstral-7B-v0.1-exl2_4.0bpw

NaNK
license:apache-2.0
3
0

c4ai-command-r-plus-08-2024-exl2_3.0bpw

NaNK
license:cc-by-nc-4.0
3
0

Magistral-Small-2506-fp8

NaNK
license:apache-2.0
3
0

Qwen3-235B-A22B-Instruct-2507-exl3-4.0bpw

NaNK
license:apache-2.0
3
0

Qwen3-235B-A22B-Instruct-2507-exl3-5.0bpw

NaNK
license:apache-2.0
3
0

Qwen3-235B-A22B-Thinking-2507-exl3-2.0bpw

NaNK
license:apache-2.0
3
0

DeepSeek-V3.1-GGUF

DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects: - Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template. - Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved. - Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3.1-Base | 671B | 37B | 128K | HuggingFace \| ModelScope | | DeepSeek-V3.1 | 671B | 37B | 128K | HuggingFace \| ModelScope | The details of our chat template is described in `tokenizerconfig.json` and `assets/chattemplate.jinja`. Here is a brief description. With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token ` `. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` By concatenating the context and the prefix, we obtain the correct prompt for the query. The prefix of thinking mode is similar to DeepSeek-R1. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the ` ` is retained in every turn of context. ToolCall Toolcall is supported in non-thinking mode. The format is: ` {system prompt}\n\n{tooldescription} {query} ` where the tooldescription is Code-Agent We support various code agent frameworks. Please refer to the above toolcall format to create your own code agents. An example is shown in `assets/codeagenttrajectory.html`. Search-Agent We design a specific format for searching toolcall in thinking mode, to support search agent. For complex questions that require accessing external or up-to-date information, DeepSeek-V3.1 can leverage a user-provided search tool through a multi-turn tool-calling process. Please refer to the `assets/searchtooltrajectory.html` and `assets/searchpythontooltrajectory.html` for the detailed template. Evaluation | Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |----------|----------------------------------|-----------------|---|---|---| | General | | | MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 | | MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 | | GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 | | Humanity's Last Exam (Pass@1) | - | - | 15.9 | 17.7 |Search Agent| | | BrowseComp | - | - | 30.0 | 8.9 | | BrowseCompzh | - | - | 49.2 | 35.7 | | Humanity's Last Exam (Python + Search) |- | - | 29.8 | 24.8 | | SimpleQA | - | - | 93.4 | 92.3 | Code | | | LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 | | Codeforces-Div1 (Rating) | - | - | 2091 | 1930 | | Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 | Code Agent| | | SWE Verified (Agent mode) | 66.0 | 45.4 | - | 44.6 | | SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | - | 30.5 | | Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | - | 5.7 | Math | | | AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 | | AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 | | HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 | Note: - Search agents are evaluated with our internal search framework, which uses a commercial search API + webpage filter + 128K context window. Seach agent results of R1-0528 are evaluated with a pre-defined workflow. - SWE-bench is evaluated with our internal code agent framework. The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].

NaNK
license:mit
3
0

Hermes-2-Theta-Llama-3-70B-exl2_5.0bpw

NaNK
llama
2
2

L3-70B-Euryale-v2.1_exl2_8.0bpw

NaNK
llama
2
1

L3-70B-Euryale-v2.1_exl2_6.0bpw

NaNK
llama
2
1

L3-Aethora-15B-V2-exl2_8.0bpw

NaNK
llama
2
1

Meta-Llama-3.1-8B-Instruct-exl2_5.0bpw

NaNK
llama
2
1

DeepSeek-R1-0528-Qwen3-8B-exl3-8.0bpw

NaNK
license:mit
2
1

WizardLM-2-8x22-exl2_4.5bpw

NaNK
license:apache-2.0
2
0

Big-Tiger-Gemma-27B-v1-exl2_3.5bpw

NaNK
β€”
2
0

c4ai-command-r-plus-08-2024-exl2_8.0bpw

NaNK
license:cc-by-nc-4.0
2
0

Mistral-Large-Instruct-2411-exl2_5.0bpw

NaNK
β€”
2
0

QVQ-72B-Preview-exl2_5.0bpw

NaNK
β€”
2
0

DeepSeek-R1-0528-Qwen3-8B-exl3-6.0bpw

NaNK
license:mit
2
0

pixtral-12b-240910

NaNK
β€”
1
6

Hermes-2-Theta-Llama-3-70B-exl2_4.0bpw

NaNK
llama
1
2

c4ai-command-r-plus-08-2024-exl2_3.5bpw

NaNK
license:cc-by-nc-4.0
1
2

c4ai-command-r-plus-6.0bpw-exl2

NaNK
license:cc-by-nc-4.0
1
1

gradientai_Llama-3-8B-Instruct-262k_v2_exl2_5.0bpw

NaNK
llama
1
1

Codestral-22B-v0.1-exl2_6.0bpw

NaNK
β€”
1
1

Codestral-22B-v0.1-exl2_5.0bpw

NaNK
β€”
1
1

L3-70B-Euryale-v2.1_exl2_4.0bpw

NaNK
llama
1
1

L3-Aethora-15B-V2-exl2_6.0bpw

NaNK
llama
1
1

Meta-Llama-3.1-8B-Instruct-exl2_6.0bpw

NaNK
llama
1
1

Meta-Llama-3.1-70B-Instruct-exl2_6.0bpw

NaNK
llama
1
1

c4ai-command-r-08-2024-exl2_5.0bpw

NaNK
license:cc-by-nc-4.0
1
1

Qwen3-32B-exl3-4.83bpw

NaNK
license:apache-2.0
1
1

DeepSeek-R1-0528-bf16

NaNK
license:mit
1
1

gradientai_Llama-3-8B-Instruct-262k_exl2_6.0bpw

NaNK
llama
1
0

Qwen2-72B-Instruct_exl2_6.0bpw

NaNK
β€”
1
0

Meta-Llama-3-8B-Instruct_exl2_8.0bpw

NaNK
llama
1
0

Llama-3-Instruct-8B-SPPO-Iter3-exl2_6.0bpw

NaNK
llama
1
0

Hermes-2-Theta-Llama-3-70B-32k-exl2_6.0bpw

NaNK
llama
1
0

Hermes-2-Theta-Llama-3-70B-32k-exl2_4.0bpw

NaNK
llama
1
0

Meta-Llama-3.1-70B-Instruct-exl2_5.0bpw

NaNK
llama
1
0

Meta-Llama-3.1-70B-Instruct-exl2_8.0bpw

NaNK
llama
1
0

Mistral-Large-Instruct-2407-exl2_4.0bpw

NaNK
β€”
1
0

Hermes-3-Llama-3.1-8B-exl2_5.0bpw

NaNK
llama
1
0

Hermes-3-Llama-3.1-70B-exl2_5.0bpw

NaNK
llama
1
0

c4ai-command-r-plus-08-2024-exl2_4.0bpw

NaNK
license:cc-by-nc-4.0
1
0

c4ai-command-r-plus-08-2024-exl2_5.0bpw

NaNK
license:cc-by-nc-4.0
1
0

Athene-V2-Chat-exl2_5.0bpw

NaNK
β€”
1
0

Mistral-Large-Instruct-2411-exl2_3.5bpw

Mistral-Large-Instruct-2411 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities extending Mistral-Large-Instruct-2407 with better Long Context, Function Calling and System Prompt. Key features - Multi-lingual by design: Dozens of languages supported, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch and Polish. - Proficient in coding: Trained on 80+ coding languages such as Python, Java, C, C++, Javacsript, and Bash. Also trained on more specific languages such as Swift and Fortran. - Agent-centric: Best-in-class agentic capabilities with native function calling and JSON outputting. - Advanced Reasoning: State-of-the-art mathematical and reasoning capabilities. - Mistral Research License: Allows usage and modification for non-commercial usages. - Large Context: A large 128k context window. - Robust Context Adherence: Ensures strong adherence for RAG and large context applications. - System Prompt: Maintains strong adherence and support for more reliable system prompts. System Prompt We appreciate the feedback received from our community regarding our system prompt handling. In response, we have implemented stronger support for system prompts. To achieve optimal results, we recommend always including a system prompt that clearly outlines the bot's purpose, even if it is minimal. Be careful with subtle missing or trailing white spaces! Please make sure to use mistral-common as the source of truth The model can be used with the following frameworks We recommend using this model with the vLLM library to implement production-ready inference pipelines. Also make sure you have `mistralcommon >= 1.5.0` installed: You can also make use of a ready-to-go docker image or on the docker hub. We recommand that you use Mistral-Large-Instruct-2411 in a server/client setting. Note: Running Mistral-Large-Instruct-2411 on GPU requires over 300 GB of GPU RAM. 2. To ping the client you can use a simple Python snippet. Mistral-Large-2411 has much improved function calling capabilities that are fully supported using `mistralcommon >= 1.5.0` and `vLLM >= v0.6.4.post1`. Make sure to serve the model with the following flags in vLLM: Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Diogo Costa, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Jessica Chudnovsky, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, LΓ©lio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, MickaΓ«l Seznec, Nicolas Schuhl, Niklas Muhs, Olivier de Garrigues, Patrick von Platen, Paul Jacob, Pauline Buche, Pavan Kumar Reddy, Perry Savas, Pierre Stock, Romain Sauvestre, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Wang, ThΓ©ophile Gervet, TimothΓ©e Lacroix, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall

NaNK
β€”
1
0

QVQ-72B-Preview-exl2_4.0bpw

NaNK
β€”
1
0

QVQ-72B-Preview-exl2_4.5bpw

NaNK
β€”
1
0

EVA-Qwen2.5-72B-v0.2-exl2_3.5bpw

NaNK
β€”
1
0

Llama-3.3-70B-Instruct-exl3-4.58bpw

NaNK
llama
1
0

Qwen3-32B-exl3-3.63bpw

NaNK
license:apache-2.0
1
0

DeepSeek-R1-0528-Qwen3-8B-exl3-5.0bpw

NaNK
license:mit
1
0

Mistral-Small-3.2-24B-Instruct-2506-FP8

NaNK
license:apache-2.0
1
0

gradientai_Llama-3-8B-Instruct-262k_exl2_8.0bpw

NaNK
llama
0
3

Codestral-22B-v0.1-exl2_8.0bpw

NaNK
β€”
0
3

Mistral-Large-Instruct-2407-exl2_5.0bpw

NaNK
β€”
0
3

Llama-3SOME-8B-v2-exl2_6.0bpw

NaNK
llama
0
2

L3.1-8B-Celeste-V1.5-exl2_8.0bpw

NaNK
llama
0
2

Qwen3-30B-A3B-awq

NaNK
β€”
0
2

VibeVoice-Realtime-0.5B

NaNK
license:mit
0
1

c4ai-command-r-plus-8.0bpw-exl2

NaNK
license:cc-by-nc-4.0
0
1

gradientai_Llama-3-8B-Instruct-262k_v2_exl2_6.0bpw

NaNK
llama
0
1

Qwen2-72B-Instruct_exl2_8.0bpw

NaNK
β€”
0
1

WizardLM-2-8x22-exl2_4.0bpw

NaNK
license:apache-2.0
0
1

L3-Aethora-15B-V2-exl2_4.0bpw

NaNK
llama
0
1

L3-Aethora-15B-V2-exl2_5.0bpw

NaNK
llama
0
1

Hermes-2-Theta-Llama-3-70B-32k-exl2_8.0bpw

NaNK
llama
0
1

Big-Tiger-Gemma-27B-v1-exl2_6.0bpw

NaNK
β€”
0
1

c4ai-command-r-08-2024-exl2_4.0bpw

NaNK
license:cc-by-nc-4.0
0
1

c4ai-command-r-08-2024-exl2_8.0bpw

NaNK
license:cc-by-nc-4.0
0
1

Reflection-70B-exl2_5.0bpw

NaNK
llama
0
1

Reflection-70B-exl2_6.0bpw

NaNK
llama
0
1

Athene-V2-Chat-exl2_8.0bpw

NaNK
β€”
0
1

Mistral-Large-Instruct-2411-exl2_3.0bpw

NaNK
β€”
0
1