RedHatAI
Meta-Llama-3.1-70B-Instruct-quantized.w4a16
--- tags: - int4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: llama3.1 base_model: meta-llama/Meta-Llama-3.1-70B-Instruct ---
Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16
--- language: - en - fr - de - es - it - pt - hi - id - tl - vi - ar - bg - zh - da - el - fa - fi - he - ja - ko - ms - nl - no - pl - ro - ru - sr - sv - th - tr - uk - ur - zsm - nld base_model: - mistralai/Mistral-Small-3.1-24B-Instruct-2503 pipeline_tag: image-text-to-text tags: - mistralai - mistral - mistral3 - mistral-small - neuralmagic - redhat - llmcompressor - quantized - W4A16 - INT4 - conversational - compressed-tensors - fast license: apache-2.0 license_name: apache-2.0 name: RedH
Mistral-7B-Instruct-v0.3-GPTQ-4bit
--- license: apache-2.0 base_model: mistralai/Mistral-7B-Instruct-v0.3
Meta-Llama-3.1-8B-Instruct-FP8
--- tags: - fp8 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: llama3.1 base_model: meta-llama/Meta-Llama-3.1-8B-Instruct ---
Llama-3.3-70B-Instruct-FP8-dynamic
--- language: - en - de - fr - it - pt - hi - es - th base_model: - meta-llama/Llama-3.3-70B-Instruct pipeline_tag: text-generation tags: - llama - facebook - meta - llama-3 - fp8 - quantized - conversational - text-generation-inference - compressed-tensors license: llama3.3 license_name: llama-3.3 name: RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic description: This model was obtained by quantizing activation and weights of Llama-3.3-70B-Instruct to FP8 data type. readme: https://huggingface.co/R
Llama-3.2-1B-Instruct-FP8-dynamic
gemma-3-27b-it-FP8-dynamic
Qwen2.5-1.5B-quantized.w8a8
Devstral-Small-2507-FP8-Dynamic
Llama-3.2-1B-Instruct-FP8
gpt-oss-20b-speculator.eagle3
Model Overview - Verifier: openai/gpt-oss-20b - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 11/21/2025 - Version: 2.0 - Model Developers: RedHat This is a speculator model designed for use with openai/gpt-oss-20b, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered dataset and the `trainsft` split of the HuggingFaceH4/ultrachat200k dataset. This model should be used with the openai/gpt-oss-20b chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.63 2.05 2.18 2.31 2.33 2.38 2.35 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 2xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "gpt-oss-20b-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'
Qwen2.5-VL-7B-Instruct-FP8-Dynamic
Model Overview - Model Architecture: Qwen2.5-VL-7B-Instruct - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 2/24/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of Qwen/Qwen2.5-VL-7B-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. The model was evaluated using mistral-evals for vision-related tasks and using lmevaluationharness for select text-based benchmarks. The evaluations were conducted using the following commands: Vision Tasks - vqav2 - docvqa - mathvista - mmmu - chartqa Category Metric Qwen/Qwen2.5-VL-7B-Instruct neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic Recovery (%) Vision MMMU (val, CoT) explicitpromptrelaxedcorrectness 52.00 52.55 101.06% ChartQA (test, CoT) anywhereinanswerrelaxedcorrectness 86.44 86.80 100.42% Mathvista (testmini, CoT) explicitpromptrelaxedcorrectness 69.47 71.07 102.31% This model achieves up to 1.3x speedup in single-stream deployment and 1.37x in multi-stream deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM. Single-stream performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Model Average Cost Reduction Latency (s) Queries Per Dollar Latency (s)th> Queries Per Dollar Latency (s) Queries Per Dollar neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.50 3.6 1248 2.1 2163 2.0 2237 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 2.05 3.3 1351 1.4 3252 1.4 3321 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.24 2.4 851 1.4 1454 1.3 1512 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.49 2.2 912 1.1 1791 1.0 1950 neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic 1.28 1.6 698 0.9 1181 0.9 1219 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.28 1.6 686 0.9 1191 0.9 1228 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025). Multi-stream asynchronous performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Model Average Cost Reduction Maximum throughput (QPS) Queries Per Dollar Maximum throughput (QPS) Queries Per Dollar Maximum throughput (QPS) Queries Per Dollar neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.41 0.5 2297 2.3 10137 2.5 11472 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.60 0.4 1828 2.7 12254 3.4 15477 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.27 0.8 1639 3.4 6851 3.9 7918 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.21 0.7 1314 3.0 5983 4.6 9206 neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic 1.29 1.2 1331 3.8 4109 4.2 4598 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.28 1.2 1298 3.8 4190 4.2 4573 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
Qwen3-30B-A3B-Instruct-2507-speculator.eagle3
Qwen3.5-122B-A10B-NVFP4
Meta-Llama-3.1-8B-Instruct-quantized.w4a16
Qwen3-VL-235B-A22B-Instruct-FP8-dynamic
Model Overview - Model Architecture: Qwen3VLMoeForConditionalGeneration - Input: Text/Image/Video - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 09/28/202510 - Version: 1.0 - Model Developers:: Red Hat Quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-VL-235B-A22B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval and on vision tasks using lmms-eval. vLLM was used for all evaluations. Category Metric Qwen3-VL-235B-A22B-Instruct Qwen3-VL-235B-A22B-Instruct-FP8-dynamic Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 76.54 75.94 99.2
Meta-Llama-3.1-8B-Instruct-FP8-dynamic
Qwen3-32B-FP8-dynamic
Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Activation quantization: FP8 - Weight quantization: FP8 - Intended Use Cases: - Reasoning. - Function calling. - Subject matter experts via fine-tuning. - Multilingual instruction following. - Translation. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 05/02/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing activations and weights of Qwen3-32B to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.
Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
Meta-Llama-3.1-8B-Instruct-quantized.w8a8
Model Overview - Model Architecture: Meta-Llama-3 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 7/11/2024 - Version: 1.0 - Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5 - License(s): Llama3.1 - Model Developers: Neural Magic This model is a quantized version of Meta-Llama-3.1-8B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation. Meta-Llama-3.1-8B-Instruct-quantized.w8a8 achieves 105.4% recovery for the Arena-Hard evaluation, 100.3% for OpenLLM v1 (using Meta's prompting when available), 101.5% for OpenLLM v2, 99.7% for HumanEval pass@1, and 98.8% for HumanEval+ pass@1. This model was obtained by quantizing the weights of Meta-Llama-3.1-8B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. See Red Hat AI Inference Server documentation for more details. See Red Hat Enterprise Linux AI documentation for more details. See Red Hat Openshift AI documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks. In all cases, model outputs were generated with the vLLM engine. Arena-Hard evaluations were conducted using the Arena-Hard-Auto repository. The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4. We report below the scores obtained in each judgement and the average. OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of lm-evaluation-harness (branch llama3.1instruct). This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals and a few fixes to OpenLLM v2 tasks. HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the EvalPlus repository. Detailed model outputs are available as HuggingFace datasets for Arena-Hard, OpenLLM v2, and HumanEval. Note: Results have been updated after Meta modified the chat template. Meta-Llama-3.1-8B-Instruct-quantized.w8a8 (this model) The results were obtained using the following commands:
gpt-oss-20b
Qwen3-VL-235B-A22B-Instruct-NVFP4
Model Overview - Model Architecture: Qwen/Qwen3-VL-235B-A22B-Instruct - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-VL-235B-A22B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-VL-235B-A22B-Instruct RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 (this model) Recovery The results were obtained using the following commands:
Llama-3.3-70B-Instruct-quantized.w8a8
Model Overview - Model Architecture: Llama - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Llama-3.3-70B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 01/20/2025 - Version: 1.0 - Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5 - Model Developers: Neural Magic Quantized version of Llama-3.3-70B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation. Llama-3.3-70B-Instruct-quantized.w8a8 achieves 99.4% recovery for OpenLLM v1 (using Meta's prompting when available) and 100% for both HumanEval and HumanEval+ pass@1. This model was obtained by quantizing the weights and activations of Llama-3.3-70B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. See Red Hat AI Inference Server documentation for more details. See Red Hat Enterprise Linux AI documentation for more details. See Red Hat Openshift AI documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks. In all cases, model outputs were generated with the vLLM engine. OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of lm-evaluation-harness (branch llama3.1instruct). This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals and a few fixes to OpenLLM v2 tasks. HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the EvalPlus repository. The results were obtained using the following commands:
Qwen3-8B-FP8-dynamic
gemma-3-27b-it-quantized.w4a16
Model Overview - Model Architecture: google/gemma-3-27b-it - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 6/4/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3-27b-it to INT4 data type, ready for inference with vLLM >= 0.8.0. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below: The model was evaluated using lmevaluationharness for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands: Category Metric google/gemma-3-27b-it RedHatAI/gemma-3-27b-it-quantized.w8a8 Recovery (%)
Meta-Llama-3-8B-Instruct-FP8-KV
Qwen2-7B-Instruct-FP8
Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Intended Use Cases: Intended for commercial and research use in English. Similarly to Meta-Llama-3-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 6/14/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2-7B-Instruct. It achieves an average score of 69.44 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 69.55. This model was obtained by quantizing the weights and activations of Qwen2-7B-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.0. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with 512 sequences of UltraChat. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying AutoFP8 with calibration samples from ultrachat, as presented in the code snipet below. Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using llm-compressor which supports several quantization schemes and models not supported by AutoFP8. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:
DeepSeek-Coder-V2-Lite-Instruct-FP8
Qwen3-8B-speculator.eagle3
Model Overview - Verifier: Qwen/Qwen3-8B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 07/27/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-8B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the HuggingFaceH4/ultrachat200k datasets. The model was trained with thinking turned disabled. This model should be used with the Qwen/Qwen3-8B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.62 1.96 2.13 2.24 2.25 2.29 2.30 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-8B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'
Llama-3.2-11B-Vision-Instruct-FP8-dynamic
DeepSeek-R1-Distill-Llama-8B-quantized.w8a8
Llama-3.2-1B-Instruct-quantized.w8a8
Model Overview - Model Architecture: Llama-3 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Llama-3.2-1B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 9/25/2024 - Version: 1.0 - License(s): Llama3.2 - Model Developers: Neural Magic Quantized version of Llama-3.2-1B-Instruct. It achieves scores within 5% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA. This model was obtained by quantizing the weights of Llama-3.2-1B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. The SmoothQuant algorithm is used to alleviate outliers in the activations, whereas rhe GPTQ algorithm is applied for quantization. Both algorithms are implemented in the llm-compressor library. GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's LLM compression calibration dataset. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA. Evaluation was conducted using the Neural Magic fork of lm-evaluation-harness (branch llama3.1instruct) and the vLLM engine. This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals. The results were obtained using the following commands:
Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic
Qwen3-32B-NVFP4A16
Qwen3-Coder-Next-NVFP4
Meta-Llama-3.1-8B-FP8
Llama-3.1-8B-Instruct-speculator.eagle3
Model Overview - Verifier: meta-llama/Llama-3.1-8B-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 07/27/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-3.1-8B-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the HuggingFaceH4/ultrachat200k datasets. This model should be used with the meta-llama/Llama-3.1-8B-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.70 2.19 2.50 2.78 2.77 2.98 2.99 - temperature: 0.6 - topp: 0.9 - repetitions: 5 - time per experiment: 3min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 180 \ --output-path "Llama-3.1-8B-Instruct-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'
Meta-Llama-3-70B-Instruct-FP8
Llama-3.2-1B-quantized.w8a8
Qwen2.5-VL-3B-Instruct-quantized.w8a8
Llama-3.2-90B-Vision-Instruct-FP8-dynamic
Voxtral-Small-24B-2507-FP8-dynamic
Model Overview - Model Architecture: VoxtralForConditionalGeneration - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Intended Use Cases: Voxtral builds upon Ministral-3B with powerful audio understanding capabilities. - Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly - Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding - Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models - Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) - Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents - Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B - Release Date: 08/21/2025 - Version: 1.0 - Model Developers: Red Hat This model was obtained by quantizing activation and weights of Voxtral-Small-24B-2507 to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the MLP operators within transformers blocks of the language model are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization. 2. Send requests to the server, according to the use case. See the following examples. This model was quantized using the llm-compressor library as shown below. After quantization, the model can be converted back into the mistralai format using the `convertvoxtralhftomistral.py` script included with the model. The model was evaluated on the Fleurs transcription task. Recovery is computed with respect to the complement of the word error rate (WER). Benchmark Language Voxtral-Small-24B-2507 Voxtral-Small-24B-2507-FP8-dynamic (this model) Recovery
gemma-3-4b-it-quantized.w4a16
Qwen3-30B-A3B-FP8-dynamic
Qwen2-1.5B-Instruct-FP8
Qwen2.5-VL-3B-Instruct-FP8-dynamic
Mistral-Small-3.2-24B-Instruct-2506-NVFP4
Model Overview - Model Architecture: unsloth/Mistral-Small-3.2-24B-Instruct-2506 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of unsloth/Mistral-Small-3.2-24B-Instruct-2506. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of unsloth/Mistral-Small-3.2-24B-Instruct-2506 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. Category Metric unsloth/Mistral-Small-3.2-24B-Instruct-2506 RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 Recovery The results were obtained using the following commands:
gemma-3-12b-it-FP8-dynamic
Meta-Llama-3.1-70B-Instruct-FP8
gemma-3-1b-it-FP8-dynamic
Qwen3-30B-A3B-quantized.w4a16
Llama-3.2-3B-Instruct-FP8
Mistral-Nemo-Instruct-2407-FP8
Meta-Llama-3.1-70B-Instruct-quantized.w8a8
DeepSeek-R1-Distill-Llama-70B-quantized.w8a8
Meta-Llama-3.1-8B-Instruct-quantized.w8a16
Meta-Llama-3-8B-Instruct-quantized.w8a8
Qwen2.5-Coder-14B-Instruct-FP8-dynamic
DeepSeek-R1-Distill-Qwen-7B-quantized.w8a8
Qwen3-8B-quantized.w4a16
phi-4-quantized.w8a8
DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8
DeepSeek-R1-Distill-Qwen-1.5B-quantized.w8a8
Meta-Llama-3.1-405B-Instruct-FP8-dynamic
Llama-4-Maverick-17B-128E-Instruct-NVFP4
DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8
Llama-3.2-3B-quantized.w8a8
Meta-Llama-3-8B-Instruct-FP8
Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
gemma-3-4b-it-FP8-dynamic
Magistral-Small-2506-FP8
gemma-3-12b-it-quantized.w8a8
Qwen2.5-VL-72B-Instruct-FP8-dynamic
Qwen2.5-VL-7B-Instruct-quantized.w8a8
NVIDIA-Nemotron-Nano-9B-v2-FP8-dynamic
Model Overview - Model Architecture: NemotronHForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 9/30/2025 - V...
Qwen3-0.6B-FP8-BLOCK
gemma-3-12b-it-quantized.w4a16
Qwen3-30B-A3B-NVFP4
Model Overview - Model Architecture: Qwen/Qwen3-30B-A3B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-30B-A3B. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-30B-A3B to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen3-30B-A3B Qwen3-30B-A3B-NVFP4 (this model) Recovery The results were obtained using the following commands:
Meta-Llama-3.1-70B-Instruct-FP8-dynamic
gpt-oss-120b-FP8-dynamic
Model Overview - Model Architecture: gpt-oss-120b-BF16 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 08/13/2025 - Version: 1.0 - Model Developers: RedHatAI
DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16
Pixtral-Large-Instruct-2411-hf-quantized.w4a16
Llama-4-Scout-17B-16E-Instruct
Qwen3-4B-quantized.w4a16
gemma-3-27b-it-quantized.w8a8
gpt-oss-120b-speculator.eagle3
whisper-large-v3-quantized.w4a16
Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:
Qwen3-0.6B-FP8-dynamic
Llama-3.1-8B-Instruct
Qwen3-14B-FP8-dynamic
Mistral-7B-Instruct-v0.3-FP8
gemma-3-1b-it-quantized.w8a8
Model Overview - Model Architecture: google/gemma-3-1b-it - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 6/4/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3-1b-it to INT8 data type, ready for inference with vLLM >= 0.8.0. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below: The model was evaluated using lmevaluationharness for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands: Category Metric google/gemma-3-1b-it RedHatAI/gemma-3-1b-it-quantized.w8a8 Recovery (%)
Qwen2.5-VL-7B-Instruct-quantized.w4a16
Qwen2.5-VL-3B-Instruct-quantized.w4a16
Llama-2-7b-ultrachat200k
Mistral-Small-24B-Instruct-2501-FP8-dynamic
gemma-3-1b-it-quantized.w4a16
Qwen3.5-35B-A3B-FP8-dynamic
phi-4-quantized.w4a16
gpt-oss-120b
NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic
Qwen3-30B-A3B-Instruct-2507-quantized.w4a16
Qwen3-30B-A3B-Instruct-2507.w4a16
DeepSeek-R1-Distill-Llama-8B-quantized.w4a16
Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic
Llama-2-7b-gsm8k
Qwen3-235B-A22B-FP8-dynamic
Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8
Qwen3-235B-A22B-Instruct-2507-NVFP4
Model Overview - Model Architecture: Qwen/Qwen3-235B-A22B-Instruct-2507 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-235B-A22B-Instruct-2507. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-235B-A22B-Instruct-2507 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-235B-A22B-Instruct-2507 RedHatAI/Qwen3-235B-A22B-Instruct-2507-NVFP4 (this model) Recovery The results were obtained using the following commands:
llama2.c-stories110M-pruned50
DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16
granite-4.0-h-small-FP8-dynamic
Meta-Llama-3.1-70B-FP8
Llama-3.3-70B-Instruct-quantized.w4a16
Qwen3-32B-NVFP4
gemma-3n-E4B-it-FP8-dynamic
Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
Meta-Llama-3.1-405B-Instruct-FP8
Qwen3-32B-quantized.w4a16
Qwen3-32B-speculator.eagle3
Model Overview - Verifier: Qwen/Qwen3-32B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/17/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-32B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the Qwen/Qwen3-32B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.62 1.95 2.15 2.23 2.27 2.32 2.33 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 2xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-32B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'
Llama-3.3-70B-Instruct
Qwen3-Next-80B-A3B-Instruct-quantized.w4a16
DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
granite-3.1-8b-instruct-quantized.w4a16
Devstral-Small-2507-quantized.w8a8
Model Overview - Model Architecture: MistralForCausalLM - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Release Date: 08/29/2025 - Version: 1.0 - Model Developers: Red Hat (Neural Magic) This model was obtained by quantizing weights and activations of Devstral-Small-2507 to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%). Weight quantization also reduces disk size requirements by approximately 50%. This model was created with llm-compressor by running the code snippet below. This model can be deployed efficiently using the vLLM backend, as shown in the example below. The model was evaluated on popular coding tasks (HumanEval, HumanEval+, MBPP, MBPP+) via EvalPlus and vllm backend (v0.10.1.1). For evaluations, we run greedy sampling and report pass@1. The command to reproduce evals: | | Recovery (%) | mistralai/Devstral-Small-2507 | RedHatAI/Devstral-Small-2507-quantized.w8a8 (this model) | | --------------------------- | :----------: | :------------------: | :--------------------------------------------------: | | HumanEval | 100.67 | 89.0 | 89.6 | | HumanEval+ | 101.48 | 81.1 | 82.3 | | MBPP | 98.71 | 77.5 | 76.5 | | MBPP+ | 102.42 | 66.1 | 67.7 | | Average Score | 100.77 | 78.43 | 79.03 |
Qwen3-0.6B-quantized.w4a16
Qwen3-14B-quantized.w4a16
Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Intended Use Cases: - Reasoning. - Function calling. - Subject matter experts via fine-tuning. - Multilingual instruction following. - Translation. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 05/05/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing the weights of Qwen3-14B to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a asymmetric per-group scheme, with group size 64. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.
Voxtral-Mini-3B-2507-FP8-dynamic
Qwen2-72B-Instruct-FP8
Mixtral-8x7B-Instruct-v0.1-FP8
whisper-large-v3-turbo-quantized.w4a16
Mistral-Nemo-Instruct-2407-quantized.w4a16
granite-4.0-h-tiny-FP8-dynamic
phi-4-FP8-dynamic
Qwen2-0.5B-Instruct-FP8
Llama-4-Scout-17B-16E-Instruct-NVFP4
Llama-2-7b-chat-quantized.w8a8
whisper-large-v3-FP8-dynamic
Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:
Llama-3.2-3B-Instruct-quantized.w8a8
Qwen3-Next-80B-A3B-Instruct-FP8
Llama-3.3-70B-Instruct-speculator.eagle3
Model Overview - Verifier: meta-llama/Llama-3.3-70B-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/15/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-3.3-70B-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the meta-llama/Llama-3.3-70B-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.71 2.21 2.52 2.74 2.83 2.87 2.89 - temperature: 0 - repetitions: 5 - time per experiment: 4min - hardware: 4xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 240 \ --output-path "Llama-3.3-70B-Instruct-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'
DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic
Llama-3.2-11B-Vision-Instruct-quantized.w4a16
Phi-4-reasoning-FP8-dynamic
DeepSeek-R1-Distill-Qwen-7B-quantized.w4a16
Qwen3-30B-A3B-Thinking-2507-speculator.eagle3
Qwen2.5-7B-Instruct-FP8-dynamic
Llama-2-7b-pruned70-retrained
DeepSeek-V2.5-1210-FP8
Apertus-8B-Instruct-2509-FP8-dynamic
whisper-large-v3-quantized.w8a8
Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:
SmolLM3-3B-FP8-dynamic
Meta-Llama-3.1-8B-quantized.w8a8
whisper-large-v3-turbo-FP8-dynamic
Model Overview - Model Architecture: whisper-large-v3-turbo - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of openai/whisper-large-v3-turbo. This model was obtained by quantizing the weights of openai/whisper-large-v3-turbo to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:
Qwen3-235B-A22B-NVFP4
Model Overview - Model Architecture: Qwen/Qwen3-235B-A22B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-235B-A22B. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-235B-A22B to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-235B-A22B RedHatAI/Qwen3-235B-A22B-NVFP4 (this model) Recovery The results were obtained using the following commands:
Llama-4-Maverick-17B-128E-Instruct-FP8
Qwen3-235B-A22B-Instruct-2507-speculator.eagle3
Meta-Llama-3-8B-Instruct-quantized.w8a16
Qwen2-57B-A14B-Instruct-FP8
Phi-3-medium-128k-instruct-quantized.w4a16
DeepSeek-R1-Distill-Llama-8B-FP8-dynamic
Mistral-7B-Instruct-v0.3-quantized.w4a16
gemma-2-9b-it-FP8
Llama-3.1-8B-Instruct-NVFP4
Model Overview - Model Architecture: Meta-Llama-3.1 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/23/2025 - Version: 1.0 - License(s): llama3.1 - Model Developers: RedHatAI This model is a quantized version of Meta-Llama-3.1-8B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks. All evaluations were conducted using lm-evaluation-harness. Category Metric Meta-Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct-NVFP4 (this model) Recovery gsm8kllama 78.17 79.30 101.45 hellaswag 78.43 78.01 99.46 mmlullama 69.37 65.95 95.07 mmlucotllama 72.86 68.60 94.15 truthfulqamc2 55.09 52.95 96.12 winogrande 75.77 74.03 97.70 Average 73.29 71.59 97.68 Category Metric Meta-Llama-3.1-8B-Instruct RedHatAI/Llama-3.1-8B-Instruct-NVFP4 (this model) Recovery (%) The results were obtained using the following commands:
phi-4
Kimi-K2-Instruct-quantized.w4a16
Apertus-70B-Instruct-2509-quantized.w4a16
Qwen3-4B-FP8-dynamic
gemma-3n-E2B-it-quantized.w8a8
Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights and activations of google/gemma-3n-E2B-it to INT8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it RedHatAI/gemma-3n-E2B-it-quantized.w8a8 Recovery (%)
Llama-3.1-70B-Instruct-NVFP4
Qwen2.5-VL-72B-Instruct-quantized.w4a16
Qwen2.5-7B-quantized.w8a8
Llama-4-Scout-17B-16E-Instruct-FP8-block
Model Overview - Model Architecture: Llama4ForConditionalGeneration - Input: Text, Image - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-4-Scout-17B-16E-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-4-Scout-17B-16E-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-4-Scout-17B-16E-Instruct RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 69.62 68.60 98.53 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 89.09 89.93 100.94
Llama-3.2-1B-FP8
Ministral-3-14B-Instruct-2512
Mistral-Small-24B-Instruct-2501-quantized.w8a8
Mistral-Small-24B-Instruct-2501-quantized.w4a16
QwQ-32B-FP8-dynamic
gemma-3-4b-it-quantized.w8a8
gemma-3n-E2B-it-FP8-dynamic
Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E2B-it to FP8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it FP8 Dynamic Recovery (%)
whisper-large-v2-FP8-Dynamic
Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 15.2148 14.7614 103.07%
granite-3.1-8b-instruct
DeepSeek-R1-Distill-Llama-70B-quantized.w4a16
bge-base-en-v1.5-quant
Llama-3.3-70B-Instruct-NVFP4
Model Overview - Model Architecture: Meta-Llama-3.3 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.3-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 6/25/2025 - Version: 1.0 - License(s): llama3.3 - Model Developers: RedHatAI This model is a quantized version of Meta-Llama-3.3-70B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Meta-Llama-3.3-70B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval64 benchmarks. All evaluations were conducted using lm-evaluation-harness. Category Metric Meta-Llama-3.3-70B-Instruct RedHatAI/Llama-3.3-70B-Instruct-NVFP4 (this model) Recovery gsm8kllama (8-shot, strict-match) 85.22 77.10 90.47 The results were obtained using the following commands:
TinyLlama-1.1B-Chat-v1.0-marlin
Llama-3.2-3B-Instruct-FP8-dynamic
Mixtral-8x7B-Instruct-v0.1
Apertus-70B-Instruct-2509-FP8-dynamic
Qwen3-8B-NVFP4
DeepSeek-R1-0528-quantized.w4a16
gemma-2-2b-it-quantized.w4a16
Qwen3-1.7B-FP8-dynamic
Qwen2-1.5B-Instruct-quantized.w8a8
Llama-3.1-8B-Instruct-FP8-block
Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-3.1-8B-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-3.1-8B-Instruct RedHatAI/Llama-3.1-8B-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 60.92 60.92 100.00 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 81.89 81.41 99.41
granite-3.1-8b-instruct-quantized.w8a8
gemma-2-2b-it-FP8
granite-3.1-2b-instruct-quantized.w4a16
Qwen2.5-7B-Instruct
Qwen2.5-VL-72B-Instruct-quantized.w8a8
DeepSeek-R1-Distill-Qwen-7B-FP8-dynamic
gemma-3n-E2B-it-quantized.w4a16
Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: INT16 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E2B-it to INT4 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it RedHatAI/gemma-3n-E2B-it-quantized.w4a16 Recovery (%)
Qwen3-Next-80B-A3B-Thinking-FP8-dynamic
Mistral-Small-3.1-24B-Instruct-2503
Devstral-Small-2507-quantized.w4a16
Model Overview - Model Architecture: MistralForCausalLM - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT4 - Weight quantization: None - Release Date: 08/29/2025 - Version: 1.0 - Model Developers: Red Hat (Neural Magic) This model was obtained by quantizing weights of Devstral-Small-2507 to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements (by approximately 75%). Weight quantization also reduces disk size requirements by approximately 75%. This model can be deployed efficiently using the vLLM backend, as shown in the example below. This model was created with llm-compressor by running the code snippet below. The model was evaluated on popular coding tasks (HumanEval, HumanEval+, MBPP, MBPP+) via EvalPlus and vllm backend (v0.10.1.1). For evaluations, we run greedy sampling and report pass@1. The command to reproduce evals: | | Recovery (%) | mistralai/Devstral-Small-2507 | RedHatAI/Devstral-Small-2507-quantized.w4a16 (this model) | | --------------------------- | :----------: | :------------------: | :--------------------------------------------------: | | HumanEval | 98.65 | 89.0 | 87.8 | | HumanEval+ | 100.0 | 81.1 | 81.1 | | MBPP | 98.97 | 77.5 | 76.7 | | MBPP+ | 102.12 | 66.1 | 67.5 | | Average Score | 99.81 | 78.43 | 78.28 |
gemma-3n-E4B-it-quantized.w8a8
Model Overview - Model Architecture: gemma-3n-E4B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights and activations of google/gemma-3n-E4B-it to INT8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E4B-it RedHatAI/gemma-3n-E4B-it-quantized.w8a8 Recovery (%)
gemma-2-2b-it-quantized.w8a8
Qwen2-0.5B-Instruct-quantized.w8a16
Qwen2.5-7B-Instruct-quantized.w8a16
Qwen3.5-122B-A10B-FP8-Dynamic
Qwen2.5-7B-Instruct-quantized.w8a8
Qwen3-VL-235B-A22B-Instruct-FP8-block
Qwen3-8B-FP8-block
Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-8B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-8B nm-testing/Qwen3-8B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 67.66 67.92 100.38 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 48.56 48.80 100.49
DeepSeek-R1-Distill-Qwen-1.5B-quantized.w4a16
pixtral-12b-FP8-dynamic
Mistral-Small-24B-Instruct-2501
MiniMax-M2.5
Qwen2.5-72B-Instruct-quantized.w8a8
Llama-3.1-Nemotron-70B-Instruct-HF
Llama-Guard-4-12B
Meta-Llama-3-70B-Instruct-quantized.w8a16
NVIDIA-Nemotron-3-Super-120B-A12B-BF16
NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16
Model Overview - Model Architecture: NemotronHForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 10/22/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing the weights of NVIDIA-Nemotron-Nano-9B-v2 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 64. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the set of popular reasoning tasks AIME25, Math-500, and GPQA-Diamond, using lighteval `v0.11.1.dev0`. vLLM `v0.11.1rc2.dev191+g80e945298.precompiled` was used as the inference engine for all evaluations. NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 (this model)
nomic-embed-text-v1.5
gemma-2-9b-it
Meta-Llama-3.1-405B-Instruct-quantized.w4a16
Qwen3-Next-80B-A3B-Instruct-FP8-dynamic
Ministral-3-3B-Instruct-2512
Mistral-Small-3.2-24B-Instruct-2506-FP8
DeepSeek-Coder-V2-Instruct-FP8
Sparse-Llama-3.1-8B-2of4
Qwen3-Next-80B-A3B-Thinking-quantized.w4a16
Qwen3-14B-speculator.eagle3
Model Overview - Verifier: Qwen/Qwen3-14B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/18/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-14B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the Qwen/Qwen3-14B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.60 1.90 2.06 2.14 2.17 2.19 2.21 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-14B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'
Qwen3-1.7B-quantized.w4a16
Llama-2-7b-chat-hf-FP8
GLM-4.6-NVFP4
SmolLM3-3B-quantized.w4a16
Model Overview - Model Architecture: SmolLM3-3B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: None - Release Date: 07/31/2025 - Version: 1.0 - License(s): Apache-2.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing weights of SmolLM3-3B to INT4 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%). Weight quantization also reduces disk size requirements by approximately 75%. Only weights of the linear operators within transformers blocks are quantized. The llm-compressor library is used for quantization. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below with: This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond. In all cases, model outputs were generated with the vLLM engine, and evals are collected through LightEval library.
Llama-3.1-8B-tldr-FP8-dynamic
Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 06/06/2025 - Version: 1.0 - Intended Use Cases: This model is finetuned to summarize text in the style of Reddit posts. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. - Model Developers: Red Hat (Neural Magic) This model is a quantized version of RedHatAI/Llama-3.1-8B-tldr, which is fine-tuned on the trl-lib/tldr dataset. This model recovers 100% of the BERTScore (0.366) obtained by RedHatAI/Llama-3.1-8B-tldr while providing up to 1.3x speedup. This model can be deployed efficiently using vLLM, as shown in the example below. Run the following command to start the vLLM server: Once your server is started, you can query the model using the OpenAI API: This model was created by applying llm-compressor, as presented in the code snipet below. The model was evaluated on the test split of trl-lib/tldr using the Neural Magic fork of lm-evaluation-harness (tldr branch). One can reproduce these results by using the following command: We evaluated the inference performance of this model using the first 1,000 samples from the training set of the trl-lib/tldr dataset. Benchmarking was conducted with vLLM version `0.9.0.1` and GuideLLM version `0.2.1`. The figure below presents the mean end-to-end latency per request across varying request rates. Results are shown for this model, as well as two variants: - Dense: Llama-3.1-8B-tldr - Sparse-quantized: Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic 1. Generate a JSON file containing the first 1,000 training samples: > The average output length is approximately 30 tokens per sample. We capped the generation at 128 tokens to reduce performance skew from rare, unusually verbose completions.
Qwen2.5-7B-Instruct-quantized.w4a16
gemma-2-2b-it-quantized.w8a16
Mixtral-8x22B-Instruct-v0.1-FP8
Qwen3-14B-FP8-block
Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-14B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-14B nm-testing/Qwen3-14B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 69.71 69.80 100.12 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 48.56 48.80 100.49
granite-3.1-8b-instruct-FP8-dynamic
Qwen3-14B-NVFP4
granite-3.3-8b-instruct
Model Summary: Granite-3.3-8B-Instruct is a 8-billion parameter 128K context length language model fine-tuned for improved reasoning and instruction-following capabilities. Built on top of Granite-3.3-8B-Base, the model delivers significant gains on benchmarks for measuring generic performance including AlpacaEval-2.0 and Arena-Hard, and improvements in mathematics, coding, and instruction following. It supports structured reasoning through \ \ and \ \ tags, providing clear separation between internal thoughts and final outputs. The model has been trained on a carefully balanced combination of permissively licensed data and curated synthetic tasks. - Developers: Granite Team, IBM - Website: Granite Docs - Release Date: April 16th, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages. Intended Use: This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications. Capabilities Thinking Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.3-8B-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Comparison with different models over various benchmarks 1 . Scores of AlpacaEval-2.0 and Arena-Hard are calculated with thinking=True Models Arena-Hard AlpacaEval-2.0 MMLU PopQA TruthfulQA BigBenchHard 2 DROP 3 GSM8K HumanEval HumanEval+ IFEval AttaQ Granite-3.1-2B-Instruct 23.3 27.17 57.11 20.55 59.79 61.82 20.99 67.55 79.45 75.26 63.59 84.7 Granite-3.2-2B-Instruct 24.86 34.51 57.18 20.56 59.8 61.39 23.84 67.02 80.13 73.39 61.55 83.23 Granite-3.3-2B-Instruct 28.86 43.45 55.88 18.4 58.97 63.91 44.33 72.48 80.51 75.68 65.8 87.47 Llama-3.1-8B-Instruct 36.43 27.22 69.15 28.79 52.79 73.43 71.23 83.24 85.32 80.15 79.10 83.43 DeepSeek-R1-Distill-Llama-8B 17.17 21.85 45.80 13.25 47.43 67.39 49.73 72.18 67.54 62.91 66.50 42.87 Qwen-2.5-7B-Instruct 25.44 30.34 74.30 18.12 63.06 69.19 64.06 84.46 93.35 89.91 74.90 81.90 DeepSeek-R1-Distill-Qwen-7B 10.36 15.35 50.72 9.94 47.14 67.38 51.78 78.47 79.89 78.43 59.10 42.45 Granite-3.1-8B-Instruct 37.58 30.34 66.77 28.7 65.84 69.87 58.57 79.15 89.63 85.79 73.20 85.73 Granite-3.2-8B-Instruct 55.25 61.19 66.79 28.04 66.92 71.86 58.29 81.65 89.35 85.72 74.31 84.7 Granite-3.3-8B-Instruct 57.56 62.68 65.54 26.17 66.86 69.13 59.36 80.89 89.73 86.09 74.82 88.5 Training Data: Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites. Infrastructure: We train Granite-3.3-8B-Instruct using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite-3.3-8B-Instruct builds upon Granite-3.3-8B-Base, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.3-8B-Base remain relevant. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://github.com/ibm-granite-community/ [1] Evaluated using OLMES (except AttaQ and Arena-Hard scores) [2] Added regex for more efficient asnwer extraction. [3] Modified the implementation to handle some of the issues mentioned here
Mistral-7B-Instruct-v0.3-quantized.w8a16
pixtral-12b-quantized.w4a16
Qwen3-4B-Thinking-2507-quantized.w4a16
bge-small-en-v1.5-quant
Qwen3-4B-Thinking-2507-quantized.w8a8
Devstral-Small-2-24B-Instruct-2512
Qwen2.5-72B-Instruct-quantized.w4a16
Qwen2-VL-72B-Instruct-FP8-dynamic
gemma-2-2b-quantized.w8a16
all-MiniLM-L6-v2
gemma-3n-E4B-it-quantized.w4a16
Model Overview - Model Architecture: gemma-3n-E4B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: INT16 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E4B-it to INT4 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E4B-it RedHatAI/gemma-3n-E4B-it-quantized.w4a16 Recovery (%)
pixtral-12b-quantized.w8a8
Mistral-Large-Instruct-2407-FP8
bert-large-uncased-finetuned-squadv1
Phi-3-vision-128k-instruct-W4A16-G128
Model Overview - Model Architecture: Phi-3-vision-128k-instruct - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 1/31/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of microsoft/Phi-3-vision-128k-instruct. This model was obtained by quantizing the weights of microsoft/Phi-3-vision-128k-instruct to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
Qwen3-30B-A3B-Thinking-speculator.eagle3
whisper-small-FP8-Dynamic
Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 24.6761 93.50%
DeepSeek-V3.2-NVFP4-FP8-BLOCK
whisper-large-v3-turbo-quantized.w8a8
Model Overview - Model Architecture: whisper-large-v3-turbo - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of openai/whisper-large-v3-turbo. This model was obtained by quantizing the weights of openai/whisper-large-v3-turbo to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:
embeddinggemma-300m
Llama-4-Maverick-17B-128E-Instruct-FP8-block
Model Overview - Model Architecture: Llama4ForConditionalGeneration - Input: Text, Image - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-4-Maverick-17B-128E-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-4-Maverick-17B-128E-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric meta-llama/Llama-4-Maverick-17B-128E-Instruct RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8 Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 73.38 73.38 100.00 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 89.93 90.89 101.07
Qwen3-30B-A3B-FP8-block
Llama-Guard-4-12B-FP8-dynamic
granite-3.1-2b-instruct-FP8-dynamic
DeepSeek-R1-Distill-Qwen-1.5B-FP8-dynamic
GLM-4.6-quantized.w8a8
GLM-4.6-FP8-dynamic
Qwen2.5-0.5B-Instruct-quantized.w8a8
TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
zephyr-7b-beta-marlin
Llama-3.1-Nemotron-70B-Instruct-HF-quantized.w8a8
gemma-3-12b-it
Qwen3-32B-Thinking-speculator.eagle3
Phi-3-mini-128k-instruct-quantized.w8a8
Mistral-Large-3-675B-Instruct-2512-NVFP4
Mistral-Large-3-675B-Instruct-2512
Qwen2.5-Coder-7B-FP8-dynamic
Phi-3.5-mini-instruct-FP8-KV
Llama-2-7b-evolcodealpaca
Qwen3-Coder-480B-A35B-Instruct-FP8
Qwen2.5-1.5B-quantized.w4a16
Llama-Guard-4-12B-quantized.w8a8
Qwen2-0.5B-Instruct-quantized.w4a16
Qwen2.5-32B-quantized.w4a16
Llama-2-7b-chat-quantized.w8a16
Qwen2.5-32B-Instruct-FP8-dynamic
Llama-3.1-Nemotron-70B-Instruct-HF-quantized.w4a16
Llama-3.3-70B-Instruct-FP8-block
Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-3.3-70B-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-3.3-70B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-3.3-70B-Instruct nm-testing/Llama-3.3-70B-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 72.53 72.61 100.12 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 92.57 92.57 100.00
OpenHermes-2.5-Mistral-7B-marlin
gemma-2-9b-it-quantized.w4a16
gemma-2-9b-it-quantized.w8a16
Qwen2-1.5B-Instruct-quantized.w4a16
Qwen3-32B-FP8-block
Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-32B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-32B nm-testing/Qwen3-32B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 72.95 72.78 99.77 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 49.04 49.28 100.49
bge-large-en-v1.5-quant
Meta-Llama-3-70B-Instruct-FP8-KV
Qwen3-4B-Instruct-2507-quantized.w4a16
Qwen3-4B-Instruct-2507-quantized.w8a8
Llama-3.1-8B-tldr
Qwen3-235B-A22B-speculator.eagle3
Qwen2.5-0.5B-quantized.w8a8
Meta-Llama-3.1-405B-Instruct-quantized.w8a8
granite-3.1-8b-base-quantized.w8a8
GLM-4.6-quantized.w4a16
granite-embedding-english-r2
Llama-2-7b-gsm8k-pruned_70
granite-3.1-2b-instruct-quantized.w8a8
granite-3.1-2b-base-quantized.w8a8
Qwen2-1.5B-Instruct-quantized.w8a16
DeepSeek-R1-quantized.w4a16
Mistral-7B-Instruct-v0.3-quantized.w8a8
granite-3.1-8b-base-quantized.w4a16
Qwen3-Embedding-8B
Qwen2.5-7B-quantized.w4a16
Pixtral-Large-Instruct-2411-hf-quantized.w8a8
oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp
Llama4-Maverick-17B-128E-Instruct-speculator.eagle3
Llama-4-Maverick-17B-128E-Instruct-speculators.eagle3 Model Overview - Verifier: meta-llama/Llama-4-Maverick-17B-128E-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/17/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-4-Maverick-17B-128E-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was converted into the speculators format from the model nvidia/Llama-4-Maverick-17B-128E-Eagle3. This model should be used with the meta-llama/Llama-4-Maverick-17B-128E-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.69 2.12 2.37 2.52 2.60 2.63 2.63 - temperature: 0.6 - topp: 0.9 - repetitions: 3 - time per experiment: 3min - hardware: 8xB200 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 If you use this model, please cite both the original NVIDIA model and the Speculators library: - Original model by NVIDIA Corporation - Conversion and formatting for Speculators/vLLM compatibility - Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier
Qwen2-7B-Instruct-quantized.w4a16
TinyLlama-1.1B-Chat-v1.0-pruned2.4
QwQ-32B-quantized.w8a8
Meta-Llama-3.1-70B-Instruct-quantized.w8a16
Mistral-Small-4-119B-2603-NVFP4
NVIDIA-Nemotron-3-Super-120B-A12B-FP8
starcoder2-15b-quantized.w8a16
Qwen2.5-32B-Instruct-quantized.w4a16
Qwen3-30B-A3B-Thinking-2507-quantized.w8a8
bge-base-en-v1.5-dense
Llama-2-7b_oneshot-pruned70_C4_10k
Qwen2-72B-Instruct-quantized.w8a16
oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli
oBERT-6-downstream-pruned-unstructured-90-squadv1
mpt-7b-gsm8k-pruned50-quant-ds
Qwen2-0.5B-Instruct-quantized.w8a8
Mixtral-8x7B-Instruct-v0.1-AutoFP8
Qwen2.5-14B-quantized.w8a8
oBERT-6-downstream-pruned-block4-80-squadv1
Qwen3-Next-80B-A3B-Instruct-FP8-block
oBERT-12-upstream-pruned-unstructured-97-finetuned-squadv1
whisper-small-quantized.w8a8
Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 24.1052 95.68%
Meta-Llama-3-8B-Instruct-quantized.w4a16
Llama-4-Maverick-17B-128E-Instruct
Phi-3-mini-128k-instruct-quantized.w4a16
Qwen2-72B-Instruct-quantized.w8a8
Qwen2.5-72B-FP8-dynamic
whisper-small-quantized.w4a16
Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 25.5212 90.37%
Qwen3-30B-A3B-Instruct-2507-quantized.w8a8
bge-small-en-v1.5-dense
Llama-2-7b-ultrachat200k-pruned_70-quantized-deepsparse
Phi-3-mini-128k-instruct-FP8
Qwen2-7B-Instruct-quantized.w8a8
granite-3.1-2b-base-quantized.w4a16
Qwen2-VL-72B-Instruct-quantized.w4a16
Model Overview - Model Architecture: Qwen/Qwen2-VL-72B-Instruct - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 2/24/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of Qwen/Qwen2-VL-72B-Instruct to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. The model was evaluated using mistral-evals for vision-related tasks and using lmevaluationharness for select text-based benchmarks. The evaluations were conducted using the following commands: Vision Tasks - vqav2 - docvqa - mathvista - mmmu - chartqa Category Metric Qwen/Qwen2-VL-72B-Instruct nm-testing/Qwen2-VL-72B-Instruct-quantized.W4A16 Recovery (%) Vision MMMU (val, CoT) explicitpromptrelaxedcorrectness 62.11 60.11 96.78% ChartQA (test, CoT) anywhereinanswerrelaxedcorrectness 83.40 80.72 96.78% Mathvista (testmini, CoT) explicitpromptrelaxedcorrectness 66.57 64.66 97.13% This model achieves up to 3.7x speedup in single-stream deployment and up to 3.3x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM. Single-stream performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Number of GPUs Model Average Cost Reduction Latency (s) QPD Latency (s)th> QPD Latency (s) QPD 2 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8 1.85 7.2 139 4.9 206 4.8 211 1 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 3.32 10.0 202 5.0 398 4.8 419 2 neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic 1.79 4.7 119 3.3 173 3.2 177 1 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 2.60 6.4 172 4.3 253 4.2 259 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025). Multi-stream asynchronous performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Model Average Cost Reduction Maximum throughput (QPS) QPD Maximum throughput (QPS) QPD Maximum throughput (QPS) QPD neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8 1.84 0.6 293 2.0 1021 2.3 1135 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 2.73 0.6 314 3.2 1591 4.0 2019 neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic 1.70 0.8 236 2.2 623 2.4 669 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 2.35 1.3 350 3.3 910 3.6 994 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).
whisper-large-v2-quantized.w8a8
Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 15.2148 15.4498 98.48%
gemma-2-9b-it-quantized.w8a8
Qwen2.5-14B-FP8-dynamic
DeepSeek-Coder-V2-Instruct-0724-quantized.w4a16
oBERT-12-downstream-pruned-unstructured-80-mnli
oBERT-6-downstream-pruned-unstructured-80-squadv1
oBERT-12-upstream-pruned-unstructured-97-finetuned-squadv1-v2
Qwen2-7B-Instruct-quantized.w8a16
Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16
Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16 Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Weight quantization: INT4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is a code completion AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the evol-codealpaca-v1 dataset, followed by quantization On the HumanEval benchmark, it achieves a pass@1 of 50.6, compared to 48.5 for the fine-tuned dense model Llama-3.1-8B-evolcodealpaca — demonstrating over 100% accuracy recovery. This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-evolcodealpaca-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-evolcodealpaca-2of4. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on Neural Magic's fork of EvalPlus. Metric Llama-3.1-8B-evolcodealpaca Sparse-Llama-3.1-8B-evolcodealpaca-2of4 Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16
Qwen2.5-7B-FP8-dynamic
Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamic
oBERT-12-downstream-pruned-unstructured-90-mnli
oBERT-teacher-qqp
oBERT-12-upstream-pruned-unstructured-90-finetuned-mnli
oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp-v2
MiniChat-3B-pruned50-quant-ds
Llama-2-7b-ultrachat200k-pruned_50
Llama-2-7b-dolphin-open_platypus-pruned_50-quantized-deepsparse
starcoder2-3b-FP8
starcoder2-7b-quantized.w8a16
Qwen2.5-14B-Instruct-FP8-dynamic
Phi-3-medium-128k-instruct-FP8
bge-small-en-v1.5-sparse
mpt-7b-gsm8k-pruned80-quant-ds
Sparse-Llama-3.1-8B-ultrachat_200k-2of4
Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the ultrachat200k dataset. On the AlpacaEval benchmark (version 1), it achieves a score of 61.1, compared to 62.0 for the fine-tuned dense model Llama-3.1-8B-ultrachat200k — demonstrating a 98.5% accuracy recovery. This inherits the optimizations from its parent, Sparse-Llama-3.1-8B-2of4. Namely, all linear operators within transformer blocks were pruned to the 2:4 sparsity pattern: in each group of four weights, two are retained while two are pruned. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on Neural Magic's fork of AlpacaEval benchmark. We adopt the same setup as in Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, using version 1 of the benchmark and Llama-2-70b-chat as the annotator. Metric Llama-3.1-8B-ultrachat200k Sparse-Llama-3.1-8B-ultrachat200k-2of4
oBERT-teacher-squadv1
oBERT-12-downstream-pruned-unstructured-80-squadv1
oBERT-teacher-mnli
oBERT-12-downstream-pruned-unstructured-90-qqp
oBERT-12-upstream-pruned-unstructured-90
oBERT-12-upstream-pruned-unstructured-90-v2
oBERT-12-upstream-pruned-unstructured-90-finetuned-qqp-v2
OpenHermes-2.5-Mistral-7B-pruned2.4
Nous-Hermes-2-SOLAR-10.7B-pruned2.4
Llama-2-7b-evol-code-alpaca-pruned_70-quantized-deepsparse
starcoder2-15b-FP8
starcoder2-7b-FP8
starcoder2-15b-quantized.w8a8
Qwen2.5-0.5B-quantized.w4a16
ToolACE-2-Llama-3.1-8B-FP8-dynamic
Nous-Hermes-2-Yi-34B-marlin
Qwen2-72B-Instruct-quantized.w4a16
Phi-3-medium-128k-instruct-quantized.w8a8
mpt-7b-gsm8k-pruned70-quant-ds
Pixtral-Large-Instruct-2411-hf-FP8-dynamic
oBERT-12-downstream-pruned-unstructured-97-mnli
oBERT-12-downstream-pruned-unstructured-97-qqp
oBERT-12-upstream-pretrained-dense
oBERT-12-upstream-pruned-unstructured-97
oBERT-6-downstream-dense-squadv1
oBERT-12-upstream-pruned-unstructured-90-finetuned-mnli-v2
mpt-7b-gsm8k-pruned60-quant-ds
mpt-7b-gsm8k-pruned60-pt
llama-2-7b-chat-marlin
Llama-2-7b-ultrachat200k-pruned_50-quantized-deepsparse
Llama-2-7b-evol-code-alpaca-pruned_50
DeepSeek-Coder-V2-Base-FP8
Qwen2.5-1.5B-quantized.w8a16
Qwen2.5-1.5B-FP8-dynamic
Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Qwen2.5-1.5B, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 11/27/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2.5-1.5B. It achieves an average score of 58.34 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 58.48. This model was obtained by quantizing the weights of Qwen2.5-1.5B to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:
Qwen2.5-3B-FP8-dynamic
Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Qwen2.5-3B, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 11/27/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2.5-3B. It achieves an average score of 62.50 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 63.59. This model was obtained by quantizing the weights of Qwen2.5-3B to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:
Qwen2.5-72B-Instruct-FP8-dynamic
granite-3.1-8b-instruct-GGUF
Qwen2.5-3B-quantized.w4a16
Mixtral-8x7B-v0.1-quantized.w4a16
whisper-medium-quantized.w8a8
Model Overview - Model Architecture: whisper-medium - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-medium to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 13.3371 12.6123 105.75%
SOLAR-10.7B-Instruct-v1.0-pruned50-quant-ds
Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16
Meta-Llama-3-70B-Instruct-quantized.w4a16
bge-base-en-v1.5-sparse
SmolLM-135M-Instruct-quantized.w8a8
Sparse-Llama-3.1-8B-gsm8k-2of4-FP8-dynamic
DeepSeek-V3-BF16
whisper-large-v2-quantized.w4a16
Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 15.2148 23.5763 64.53%
oBERT-12-downstream-pruned-unstructured-97-squadv1
oBERT-3-upstream-pretrained-dense
oBERT-12-upstream-pruned-unstructured-90-finetuned-qqp
oBERT-3-downstream-pruned-block4-80-squadv1
oBERT-6-downstream-pruned-block4-80-QAT-squadv1
bge-large-en-v1.5-dense
zephyr-7b-beta-pruned50-quant-ds
Nous-Hermes-2-Yi-34B-pruned2.4
Nous-Hermes-2-Yi-34B-pruned50
llama2.c-stories110M-pruned2.4
Llama-2-7b-cnn-daily-mail-pruned_70-quantized-deepsparse
SparseLLama-2-7b-ultrachat_200k-pruned_50.2of4
SparseLlama-2-7b-evolcodealpaca-pruned_50.2of4
Llama-2-7b-chat-quantized.w4a16
Meta-Llama-3-70B-Instruct-quantized.w8a8
gemma-2-27b-it-quantized.w8a16
SmolLM-135M-Instruct-quantized.w8a16
SmolLM-360M-Instruct-quantized.w8a8
Qwen2.5-72B-quantized.w8a8
Qwen2.5-32B-quantized.w8a16
granite-3.1-2b-base-FP8-dynamic
Model Overview - Model Architecture: granite-3.1-2b-base - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 1/8/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of ibm-granite/granite-3.1-2b-base. It achieves an average score of 57.37 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 57.65. This model was obtained by quantizing the weights and activations of ibm-granite/granite-3.1-2b-base to FP8 data type, ready for inference with vLLM >= 0.5.2. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on OpenLLM Leaderboard V1, OpenLLM Leaderboard V2 and on HumanEval, using the following commands: Category Metric ibm-granite/granite-3.1-2b-base neuralmagic/granite-3.1-2b-base-FP8-dynamic Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 53.75 53.50 99.54 This model achieves up to 1.2x speedup in single-stream deployment on L40 GPUs. The following performance benchmarks were conducted with vLLM version 0.6.6.post1, and GuideLLM. Single-stream performance (measured with vLLM version 0.6.6.post1) GPU class Model Speedup Code Completion prefill: 256 tokens decode: 1024 tokens Docstring Generation prefill: 768 tokens decode: 128 tokens Code Fixing prefill: 1024 tokens decode: 1024 tokens RAG prefill: 1024 tokens decode: 128 tokens Instruction Following prefill: 256 tokens decode: 128 tokens Multi-turn Chat prefill: 512 tokens decode: 256 tokens Large Summarization prefill: 4096 tokens decode: 512 tokens granite-3.1-2b-base-FP8-dynamic (this model) 1.26 7.3 0.9 7.4 1.0 0.9 1.8 4.1 granite-3.1-2b-base-quantized.w4a16 1.88 4.8 0.6 4.9 0.6 0.6 1.2 2.8
Llama-3.1-70B-Instruct-NVFP4A16
watt-tool-8B-FP8-dynamic
mpt-7b-chat-pruned50-quant-ds
Mixtral-8x22B-Instruct-v0.1-AutoFP8
mobilebert-uncased-finetuned-squadv1
OpenHermes-2.5-Mistral-7B-pruned50
Llama-2-7b-dolphin-open_platypus-pruned_70-quantized-deepsparse
Meta-Llama-3.1-8B-quantized.w8a16
Meta-Llama-3.1-405B-Instruct-quantized.w8a16
Qwen2.5-7B-quantized.w8a16
Sparse-Llama-3.1-8B-gsm8k-2of4
oBERT-12-downstream-pruned-unstructured-90-squadv1
oBERT-12-upstream-pruned-unstructured-90-finetuned-squadv1
oBERT-12-downstream-pruned-block4-80-squadv1
oBERT-12-downstream-pruned-block4-90-squadv1
oBERT-3-downstream-pruned-unstructured-80-squadv1
oBERT-6-downstream-dense-QAT-squadv1
oBERT-6-downstream-pruned-block4-90-QAT-squadv1
oBERT-12-upstream-pruned-unstructured-97-v2
oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli-v2
mpt-7b-gsm8k-pruned75-quant-ds
Llama-2-7b-evol-code-alpaca-pruned_70
Llama-2-7b-evol-code-alpaca-pruned_50-quantized-deepsparse
Llama-2-7b-dolphin-open_platypus-pruned_50
Llama-2-7b-cnn-daily-mail-pruned_50-quantized-deepsparse
Phi-3-mini-128k-instruct-quantized.w8a16
starcoder2-7b-quantized.w8a8
Phi-3-small-128k-instruct-quantized.w8a16
Qwen2.5-72B-quantized.w8a16
granite-3.0-8b-instruct-GGUF
granite-3.0-2b-instruct-GGUF
Sparse-Llama-3.1-8B-gsm8k-2of4-quantized.w4a16
Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Weight quantization: INT4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is AI model especialized in grade-school math obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the GSM8k dataset, followed by one-shot quantization. It achieves 64.3% 0-shot accuracy on the test set of GSM8k, compared to 66.3% for the fine-tuned dense model Llama-3.1-8B-gsm8k — demonstrating over 96.9% accuracy recovery. In constrast, the pretrained Llama-3.1-8B achieves 50.7% 5-shot accuracy and the sparse foundational Sparse-Llama-3.1-8B-2of4 model achieves 56.3% 5-shot accuracy. This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-gsm8k-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-gsm8k-2of4. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on the lm-evaluation-harness. Metric Llama-3.1-8B (5-shot) Sparse-Llama-3.1-8B-2of4 (5-shot) Llama-3.1-8B-gsm8k (0-shot) Sparse-Llama-3.1-8B-gsm8k-2of4 (0-shot) Sparse-Llama-3.1-8B-gsm8k-2of4-quantized.w4a16 (0-shot)
Llama-3.1-8B-evolcodealpaca
Qwen2.5-0.5B-FP8-dynamic
Qwen2.5-Coder-32B-Instruct-FP8-dynamic
Phi-4-mini-instruct-quantized.w8a8
Llama2-7b-chat-pruned50-quant-ds
Nous-Hermes-2-SOLAR-10.7B-pruned50-quant-ds
OpenHermes-2.5-Mistral-7B-pruned50-quant-ds
Phi-3-medium-128k-instruct-quantized.w8a16
mpt-7b-gsm8k-pt
mpt-7b-gsm8k-quant-ds
MiniChat-1.5-3B-pruned50-quant-ds
phi-2-super-marlin
Qwen2.5-3B-quantized.w8a8
whisper-large-v2-W4A16-G128
Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 1/31/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. BibTeX entry and citation info ```bibtex @misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }