ModelCloud
Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symTrue
Qwen1.5-1.8B-Chat-GPTQ-4bits-dynamic-cfg-with-lm_head-symFalse
gemma-2-9b-it-gptq-4bit
MiniMax-M2-GPTQMODEL-W4A16
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
MiniMax-M2-BF16
Model conversion and HF Transformer code by @Qubitum at ModelCloud . Please cite/give credit if you use this model or code. !!!! Includes Prelim (Alpha Quality) HF Transformers Support !!!! Please submit PRs if you find and fix bugs in the HF Transformer code! LFG! Today, we release and open source MiniMax-M2, a Mini model built for Max coding & agentic workflows. MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. Superior Intelligence. According to benchmarks from Artificial Analysis, MiniMax-M2 demonstrates highly competitive general intelligence across mathematics, science, instruction following, coding, and agentic tool use. Its composite score ranks #1 among open-source models globally. Advanced Coding. Engineered for end-to-end developer workflows, MiniMax-M2 excels at multi-file edits, coding-run-fix loops, and test-validated repairs. Strong performance on Terminal-Bench and (Multi-)SWE-Bench–style tasks demonstrates practical effectiveness in terminals, IDEs, and CI across languages. Agent Performance. MiniMax-M2 plans and executes complex, long-horizon toolchains across shell, browser, retrieval, and code runners. In BrowseComp-style evaluations, it consistently locates hard-to-surface sources, maintains evidence traceable, and gracefully recovers from flaky steps. Efficient Design. With 10 billion activated parameters (230 billion in total), MiniMax-M2 delivers lower latency, lower cost, and higher throughput for interactive agents and batched sampling—perfectly aligned with the shift toward highly deployable models that still shine on coding and agentic tasks. These comprehensive evaluations test real-world end-to-end coding and agentic tool use: editing real repos, executing commands, browsing the web, and delivering functional solutions. Performance on this suite correlates with day-to-day developer experience in terminals, IDEs, and CI. | Benchmark | MiniMax-M2 | Claude Sonnet 4 | Claude Sonnet 4.5 | Gemini 2.5 Pro | GPT-5 (thinking) | GLM-4.6 | Kimi K2 0905 | DeepSeek-V3.2 | |-----------|------------|-----------------|-------------------|-----------------|------------------|---------|---------------|----------------| | SWE-bench Verified | 69.4 | 72.7 | 77.2 | 63.8 | 74.9 | 68 | 69.2 | 67.8 | | Multi-SWE-Bench | 36.2 | 35.7 | 44.3 | / | / | 30 | 33.5 | 30.6 | | SWE-bench Multilingual | 56.5 | 56.9 | 68 | / | / | 53.8 | 55.9 | 57.9 | | Terminal-Bench | 46.3 | 36.4 | 50 | 25.3 | 43.8 | 40.5 | 44.5 | 37.7 | | ArtifactsBench | 66.8 | 57.3 | 61.5 | 57.7 | 73 | 59.8 | 54.2 | 55.8 | | BrowseComp | 44 | 12.2 | 19.6 | 9.9 | 54.9 | 45.1 | 14.1 | 40.1 | | BrowseComp-zh | 48.5 | 29.1 | 40.8 | 32.2 | 65 | 49.5 | 28.8 | 47.9 | | GAIA (text only) | 75.7 | 68.3 | 71.2 | 60.2 | 76.4 | 71.9 | 60.2 | 63.5 | | xbench-DeepSearch | 72 | 64.6 | 66 | 56 | 77.8 | 70 | 61 | 71 | | HLE (w/ tools) | 31.8 | 20.3 | 24.5 | 28.4 | 35.2 | 30.4 | 26.9 | 27.2 | | τ²-Bench | 77.2 | 65.5 | 84.7 | 59.2 | 80.1 | 75.9 | 70.3 | 66.7 | | FinSearchComp-global | 65.5 | 42 | 60.8 | 42.6 | 63.9 | 29.2 | 29.5 | 26.2 | | AgentCompany | 36 | 37 | 41 | 39.3 | / | 35 | 30 | 34 | >Notes: Data points marked with an asterisk () are taken directly from the model's official tech report or blog. All other metrics were obtained using the evaluation methods described below. >- SWE-bench Verified: We use the same scaffold as R2E-Gym (Jain et al. 2025) on top of OpenHands to test with agents on SWE tasks. All scores are validated on our internal infrastructure with 128k context length, 100 max steps, and no test-time scaling. All git-related content is removed to ensure agent sees only the code at the issue point. >- Multi-SWE-Bench & SWE-bench Multilingual: All scores are averaged across 8 runs using the claude-code CLI (300 max steps) as the evaluation scaffold. >- Terminal-Bench: All scores are evaluated with the official claude-code from the original Terminal-Bench repository(commit `94bf692`), averaged over 8 runs to report the mean pass rate. >- ArtifactsBench: All Scores are computed by averaging three runs with the official implementation of ArtifactsBench, using the stable Gemini-2.5-Pro as the judge model. >- BrowseComp & BrowseComp-zh & GAIA (text only) & xbench-DeepSearch: All scores reported use the same agent framework as WebExplorer (Liu et al. 2025), with minor tools description adjustment. We use the 103-sample text-only GAIA validation subset following WebExplorer (Liu et al. 2025). >- HLE (w/ tools): All reported scores are obtained using search tools and a Python tool. The search tools employ the same agent framework as WebExplorer (Liu et al. 2025), and the Python tool runs in a Jupyter environment. We use the text-only HLE subset. >- τ²-Bench: All scores reported use "extended thinking with tool use", and employ GPT-4.1 as the user simulator. >- FinSearchComp-global: Official results are reported for GPT-5-Thinking, Gemini 2.5 Pro, and Kimi-K2. Other models are evaluated using the open-source FinSearchComp (Hu et al. 2025) framework using both search and Python tools, launched simultaneously for consistency. >- AgentCompany: All scores reported use OpenHands 0.42 agent framework. We align with Artificial Analysis, which aggregates challenging benchmarks using a consistent methodology to reflect a model’s broader intelligence profile across math, science, instruction following, coding, and agentic tool use. | Metric (AA) | MiniMax-M2 | Claude Sonnet 4 | Claude Sonnet 4.5 | Gemini 2.5 Pro | GPT-5 (thinking) | GLM-4.6 | Kimi K2 0905 | DeepSeek-V3.2 | |-----------------|----------------|---------------------|------------------------|---------------------|----------------------|-------------|------------------|-------------------| | AIME25 | 78 | 74 | 88 | 88 | 94 | 86 | 57 | 88 | | MMLU-Pro | 82 | 84 | 88 | 86 | 87 | 83 | 82 | 85 | | GPQA-Diamond | 78 | 78 | 83 | 84 | 85 | 78 | 77 | 80 | | HLE (w/o tools) | 12.5 | 9.6 | 17.3 | 21.1 | 26.5 | 13.3 | 6.3 | 13.8 | | LiveCodeBench (LCB) | 83 | 66 | 71 | 80 | 85 | 70 | 61 | 79 | | SciCode | 36 | 40 | 45 | 43 | 43 | 38 | 31 | 38 | | IFBench | 72 | 55 | 57 | 49 | 73 | 43 | 42 | 54 | | AA-LCR | 61 | 65 | 66 | 66 | 76 | 54 | 52 | 69 | | τ²-Bench-Telecom | 87 | 65 | 78 | 54 | 85 | 71 | 73 | 34 | | Terminal-Bench-Hard | 24 | 30 | 33 | 25 | 31 | 23 | 23 | 29 | | AA Intelligence | 61 | 57 | 63 | 60 | 69 | 56 | 50 | 57 | >AA: All scores of MiniMax-M2 aligned with Artificial Analysis Intelligence Benchmarking Methodology (https://artificialanalysis.ai/methodology/intelligence-benchmarking). All scores of other models reported from https://artificialanalysis.ai/. By maintaining activations around 10B , the plan → act → verify loop in the agentic workflow is streamlined, improving responsiveness and reducing compute overhead: - Faster feedback cycles in compile-run-test and browse-retrieve-cite chains. - More concurrent runs on the same budget for regression suites and multi-seed explorations. - Simpler capacity planning with smaller per-request memory and steadier tail latency. In short: 10B activations = responsive agent loops + better unit economics. If you need frontier-style coding and agents without frontier-scale costs, MiniMax-M2 hits the sweet spot: fast inference speeds, robust tool-use capabilities, and a deployment-friendly footprint. We look forward to your feedback and to collaborating with developers and researchers to bring the future of intelligent collaboration one step closer. - Our product MiniMax Agent, built on MiniMax-M2, is now publicly available and free for a limited time: https://agent.minimaxi.io/ - The MiniMax-M2 API is now live on the MiniMax Open Platform and is free for a limited time: https://platform.minimax.io/docs/guides/text-generation - The MiniMax-M2 model weights are now open-source, allowing for local deployment and use: https://huggingface.co/MiniMaxAI/MiniMax-M2. Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2. We recommend using the following inference frameworks (listed alphabetically) to serve the model: We recommend using SGLang to serve MiniMax-M2. SGLang provides solid day-0 support for MiniMax-M2 model. Please refer to our SGLang Deployment Guide for more details, and thanks so much for our collaboration with the SGLang team. We recommend using vLLM to serve MiniMax-M2. vLLM provides efficient day-0 support of MiniMax-M2 model, check https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html for latest deployment guide. We also provide our vLLM Deployment Guide. Inference Parameters We recommend using the following parameters for best performance: `temperature=1.0`, `topp = 0.95`, `topk = 40`. IMPORTANT: MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking content from the assistant's turns within the historical messages. In the model's output content, we use the ` ... ` format to wrap the assistant's thinking content. When using the model, you must ensure that the historical content is passed back in its original format. Do not remove the ` ... ` part, otherwise, the model's performance will be negatively affected.
Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5
GLM-4.6-GPTQMODEL-W4A16-v2
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
Qwen2.5-0.5B-Instruct-gptqmodel-w4a16
GLM-4.6-GPTQMODEL-W4A16-v1
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
sat-3l-sm-int8-onnx
Meta-Llama-3.1-8B-Instruct-gptq-4bit
Meta-Llama-3.1-8B-gptq-4bit
Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1
License: llama3.2 Language: en
GLM-4.6-REAP-268B-A32B-GPTQMODEL-W4A16
Brumby-14B-Base-GPTQMODEL-W4A16
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
glm-4-9b-chat-gptqmodel-w4a16
Granite-4.0-H-1B-GPTQMODEL-W4A16
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
Granite-4.0-H-350M-GPTQMODEL-W4A16
Llama3.2-1B-Instruct
Marin-32B-Base-GPTQMODEL-AWQ-W4A16
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
Llama-3.2-3B-Instruct-gptqmodel-4bit-vortex-v3
- bits: 4 - dynamic: null - groupsize: 32 - descact: true - staticgroups: false - sym: true - lmhead: false - truesequential: true - quantmethod: "gptq" - checkpointformat: "gptq" - meta: - quantizer: gptqmodel:1.1.0 - uri: https://github.com/modelcloud/gptqmodel - damppercent: 0.1 - dampautoincrement: 0.0015
Brumby-14B-Base-GPTQMODEL-W4A16-v2
This 4bit W4A16 model has been quantized by @Qubitum at ModelCloud using GPT-QModel.
Meta-Llama-3.1-405B-Instruct-gptq-4bit
DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
Mistral-Nemo-Instruct-2407-gptq-4bit
glm-4-9b-chat-gptq-4bit
QwQ-32B-Preview-gptqmodel-4bit-vortex-v3
Llama-3.2-1B-gptqmodel-ci-4bit
Qwen2.5-Coder-32B-Instruct-gptqmodel-4bit-vortex-v1
gemma-2-27b-it-gptq-4bit
Marin-32B-Base-GPTQMODEL-W4A16
QwQ-32B-Preview-gguf-vortex-v1
DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v1
🚨🚨🚨 Please use Vortex V2 model which fixed <think> token regression. 🚨🚨🚨 - bits: 4 - dynamic: null - groupsize: 32 - descact: true - staticgroups: false - sym: true - lmhead: false - truesequential: true - quantmethod: "gptq" - checkpointformat: "gptq" - meta: - quantizer: gptqmodel:1.7.4 - uri: https://github.com/modelcloud/gptqmodel - damppercent: 0.1 - dampautoincrement: 0.0025
QwQ-32B-Preview-gptqmodel-4bit-vortex-v1
Mistral-Large-Instruct-2407-gptq-4bit
Falcon3-10B-Instruct-gptqmodel-4bit-vortex-v1
- bits: 4 - dynamic: null - groupsize: 32 - descact: true - staticgroups: false - sym: true - lmhead: false - truesequential: true - quantmethod: "gptq" - checkpointformat: "gptq" - meta: - quantizer: gptqmodel:1.4.4 - uri: https://github.com/modelcloud/gptqmodel - damppercent: 0.1 - dampautoincrement: 0.0025
glm-4-9b-gptq-4bit
dbrx-instruct-converted-v2
TinyLlama-1.1B-Chat-v1.0-GPTQ-4bit-10-25-2024
Opt-125-GPTQ-4bit-10-25-2024
opt-125m-with-ipex-xpu
TinyLlama-1.1B-Chat-v1.0-autoround-4bit
TinyLlama-1.1B-Chat-v1.0-GPTQ-4bits-dynamic-cfg
GPTQ-v2-Llama-3.1-8B-Instruct
pangu_alpha_2_6B
PanGu-α is proposed by a joint technical team headed by PCNL. It was first released in this repository It is the first large-scale Chinese pre-trained language model with 200 billion parameters trained on 2048 Ascend processors using an automatic hybrid parallel training strategy. The whole training process is done on the “Peng Cheng Cloud Brain II” computing platform with the domestic deep learning framework called MindSpore. The PengCheng·PanGu-α pre-training model can support rich applications, has strong few-shot learning capabilities, and has outstanding performance in text generation tasks such as knowledge question and answer, knowledge retrieval, knowledge reasoning, and reading comprehension. This repository contains PyTorch implementation of PanGu model, with 2.6 billion parameters pretrained weights (FP32 precision), converted from original MindSpore checkpoint. Currently PanGu model is not supported by transformers, so `trustremotecode=True` is required to load model implementation in this repo.
QwQ-32B-Preview-gptqmodel-4bit-vortex-v2
QwQ-32B-Preview-gptqmodel-4bit-vortex-mlx-v3
gemma-2-9b-gptq-4bit
GRIN-MoE-gptq-4bit
DeepSeek-V3-0324-BF16
Falcon3-10B-Instruct-gptqmodel-4bit-vortex-mlx-v1
tinyllama-15M-stories
dbrx-base-converted-v2
DeepSeek-V2-Lite-gptq-4bit
internlm-2.5-7b-chat-1m-gptq-4bit
QwQ-32B-gptqmodel-4bit-vortex-v1
Meta-Llama-3.1-70B-Instruct-gptq-4bit
EXAONE-3.0-7.8B-Instruct-gptq-4bit
Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2
Llama-3.2-3B-Instruct-gptqmodel-4bit-vortex-mlx-v3
This model was quantized and exported to mlx using GPTQModel.