RedHatAI

500 models • 4 total models in database
Sort by:

Meta-Llama-3.1-70B-Instruct-quantized.w4a16

--- tags: - int4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: llama3.1 base_model: meta-llama/Meta-Llama-3.1-70B-Instruct ---

NaNK
llama
564,620
32

Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16

--- language: - en - fr - de - es - it - pt - hi - id - tl - vi - ar - bg - zh - da - el - fa - fi - he - ja - ko - ms - nl - no - pl - ro - ru - sr - sv - th - tr - uk - ur - zsm - nld base_model: - mistralai/Mistral-Small-3.1-24B-Instruct-2503 pipeline_tag: image-text-to-text tags: - mistralai - mistral - mistral3 - mistral-small - neuralmagic - redhat - llmcompressor - quantized - W4A16 - INT4 - conversational - compressed-tensors - fast license: apache-2.0 license_name: apache-2.0 name: RedH

NaNK
license:apache-2.0
343,096
9

Mistral-7B-Instruct-v0.3-GPTQ-4bit

--- license: apache-2.0 base_model: mistralai/Mistral-7B-Instruct-v0.3

NaNK
license:apache-2.0
316,226
23

Meta-Llama-3.1-8B-Instruct-FP8

--- tags: - fp8 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: llama3.1 base_model: meta-llama/Meta-Llama-3.1-8B-Instruct ---

NaNK
llama
177,429
42

Llama-3.3-70B-Instruct-FP8-dynamic

--- language: - en - de - fr - it - pt - hi - es - th base_model: - meta-llama/Llama-3.3-70B-Instruct pipeline_tag: text-generation tags: - llama - facebook - meta - llama-3 - fp8 - quantized - conversational - text-generation-inference - compressed-tensors license: llama3.3 license_name: llama-3.3 name: RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic description: This model was obtained by quantizing activation and weights of Llama-3.3-70B-Instruct to FP8 data type. readme: https://huggingface.co/R

NaNK
llama
120,095
13

Llama-3.2-1B-Instruct-FP8-dynamic

NaNK
llama
115,469
3

gemma-3-27b-it-FP8-dynamic

NaNK
license:apache-2.0
59,054
9

Qwen2.5-1.5B-quantized.w8a8

NaNK
license:apache-2.0
55,533
2

Devstral-Small-2507-FP8-Dynamic

license:mit
54,173
1

Llama-3.2-1B-Instruct-FP8

NaNK
llama
50,858
3

gpt-oss-20b-speculator.eagle3

Model Overview - Verifier: openai/gpt-oss-20b - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 11/21/2025 - Version: 2.0 - Model Developers: RedHat This is a speculator model designed for use with openai/gpt-oss-20b, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered dataset and the `trainsft` split of the HuggingFaceH4/ultrachat200k dataset. This model should be used with the openai/gpt-oss-20b chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.63 2.05 2.18 2.31 2.33 2.38 2.35 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 2xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "gpt-oss-20b-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'

NaNK
license:apache-2.0
40,788
0

Qwen2.5-VL-7B-Instruct-FP8-Dynamic

Model Overview - Model Architecture: Qwen2.5-VL-7B-Instruct - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 2/24/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of Qwen/Qwen2.5-VL-7B-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. The model was evaluated using mistral-evals for vision-related tasks and using lmevaluationharness for select text-based benchmarks. The evaluations were conducted using the following commands: Vision Tasks - vqav2 - docvqa - mathvista - mmmu - chartqa Category Metric Qwen/Qwen2.5-VL-7B-Instruct neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic Recovery (%) Vision MMMU (val, CoT) explicitpromptrelaxedcorrectness 52.00 52.55 101.06% ChartQA (test, CoT) anywhereinanswerrelaxedcorrectness 86.44 86.80 100.42% Mathvista (testmini, CoT) explicitpromptrelaxedcorrectness 69.47 71.07 102.31% This model achieves up to 1.3x speedup in single-stream deployment and 1.37x in multi-stream deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM. Single-stream performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Model Average Cost Reduction Latency (s) Queries Per Dollar Latency (s)th> Queries Per Dollar Latency (s) Queries Per Dollar neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.50 3.6 1248 2.1 2163 2.0 2237 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 2.05 3.3 1351 1.4 3252 1.4 3321 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.24 2.4 851 1.4 1454 1.3 1512 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.49 2.2 912 1.1 1791 1.0 1950 neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic 1.28 1.6 698 0.9 1181 0.9 1219 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.28 1.6 686 0.9 1191 0.9 1228 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025). Multi-stream asynchronous performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Model Average Cost Reduction Maximum throughput (QPS) Queries Per Dollar Maximum throughput (QPS) Queries Per Dollar Maximum throughput (QPS) Queries Per Dollar neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.41 0.5 2297 2.3 10137 2.5 11472 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.60 0.4 1828 2.7 12254 3.4 15477 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.27 0.8 1639 3.4 6851 3.9 7918 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.21 0.7 1314 3.0 5983 4.6 9206 neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic 1.29 1.2 1331 3.8 4109 4.2 4598 neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.28 1.2 1298 3.8 4190 4.2 4573 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

NaNK
license:apache-2.0
39,582
4

Qwen3-30B-A3B-Instruct-2507-speculator.eagle3

NaNK
license:apache-2.0
37,297
1

Qwen3.5-122B-A10B-NVFP4

NaNK
28,858
12

Meta-Llama-3.1-8B-Instruct-quantized.w4a16

NaNK
llama
27,582
30

Qwen3-VL-235B-A22B-Instruct-FP8-dynamic

Model Overview - Model Architecture: Qwen3VLMoeForConditionalGeneration - Input: Text/Image/Video - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 09/28/202510 - Version: 1.0 - Model Developers:: Red Hat Quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-VL-235B-A22B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval and on vision tasks using lmms-eval. vLLM was used for all evaluations. Category Metric Qwen3-VL-235B-A22B-Instruct Qwen3-VL-235B-A22B-Instruct-FP8-dynamic Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 76.54 75.94 99.2

NaNK
license:apache-2.0
26,410
4

Meta-Llama-3.1-8B-Instruct-FP8-dynamic

NaNK
llama
25,169
9

Qwen3-32B-FP8-dynamic

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Activation quantization: FP8 - Weight quantization: FP8 - Intended Use Cases: - Reasoning. - Function calling. - Subject matter experts via fine-tuning. - Multilingual instruction following. - Translation. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 05/02/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing activations and weights of Qwen3-32B to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.

NaNK
license:apache-2.0
22,974
15

Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

NaNK
llama4
22,654
27

Meta-Llama-3.1-8B-Instruct-quantized.w8a8

Model Overview - Model Architecture: Meta-Llama-3 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 7/11/2024 - Version: 1.0 - Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5 - License(s): Llama3.1 - Model Developers: Neural Magic This model is a quantized version of Meta-Llama-3.1-8B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation. Meta-Llama-3.1-8B-Instruct-quantized.w8a8 achieves 105.4% recovery for the Arena-Hard evaluation, 100.3% for OpenLLM v1 (using Meta's prompting when available), 101.5% for OpenLLM v2, 99.7% for HumanEval pass@1, and 98.8% for HumanEval+ pass@1. This model was obtained by quantizing the weights of Meta-Llama-3.1-8B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. GPTQ used a 1% damping factor and 256 sequences of 8,192 random tokens. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. ​​See Red Hat AI Inference Server documentation for more details. See Red Hat Enterprise Linux AI documentation for more details. See Red Hat Openshift AI documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. This model was evaluated on the well-known Arena-Hard, OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks. In all cases, model outputs were generated with the vLLM engine. Arena-Hard evaluations were conducted using the Arena-Hard-Auto repository. The model generated a single answer for each prompt form Arena-Hard, and each answer was judged twice by GPT-4. We report below the scores obtained in each judgement and the average. OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of lm-evaluation-harness (branch llama3.1instruct). This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals and a few fixes to OpenLLM v2 tasks. HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the EvalPlus repository. Detailed model outputs are available as HuggingFace datasets for Arena-Hard, OpenLLM v2, and HumanEval. Note: Results have been updated after Meta modified the chat template. Meta-Llama-3.1-8B-Instruct-quantized.w8a8 (this model) The results were obtained using the following commands:

NaNK
llama
21,318
18

gpt-oss-20b

NaNK
license:apache-2.0
21,170
5

Qwen3-VL-235B-A22B-Instruct-NVFP4

Model Overview - Model Architecture: Qwen/Qwen3-VL-235B-A22B-Instruct - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-VL-235B-A22B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-VL-235B-A22B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-VL-235B-A22B-Instruct RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 (this model) Recovery The results were obtained using the following commands:

NaNK
license:apache-2.0
20,427
13

Llama-3.3-70B-Instruct-quantized.w8a8

Model Overview - Model Architecture: Llama - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Llama-3.3-70B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 01/20/2025 - Version: 1.0 - Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5 - Model Developers: Neural Magic Quantized version of Llama-3.3-70B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model, including multiple-choice, math reasoning, and open-ended text generation. Llama-3.3-70B-Instruct-quantized.w8a8 achieves 99.4% recovery for OpenLLM v1 (using Meta's prompting when available) and 100% for both HumanEval and HumanEval+ pass@1. This model was obtained by quantizing the weights and activations of Llama-3.3-70B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. ​​See Red Hat AI Inference Server documentation for more details. See Red Hat Enterprise Linux AI documentation for more details. See Red Hat Openshift AI documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval+ benchmarks. In all cases, model outputs were generated with the vLLM engine. OpenLLM v1 and v2 evaluations were conducted using Neural Magic's fork of lm-evaluation-harness (branch llama3.1instruct). This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals and a few fixes to OpenLLM v2 tasks. HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the EvalPlus repository. The results were obtained using the following commands:

NaNK
llama
20,281
12

Qwen3-8B-FP8-dynamic

NaNK
license:apache-2.0
19,721
11

gemma-3-27b-it-quantized.w4a16

Model Overview - Model Architecture: google/gemma-3-27b-it - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 6/4/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3-27b-it to INT4 data type, ready for inference with vLLM >= 0.8.0. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below: The model was evaluated using lmevaluationharness for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands: Category Metric google/gemma-3-27b-it RedHatAI/gemma-3-27b-it-quantized.w8a8 Recovery (%)

NaNK
18,379
9

Meta-Llama-3-8B-Instruct-FP8-KV

NaNK
llama
16,461
8

Qwen2-7B-Instruct-FP8

Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Intended Use Cases: Intended for commercial and research use in English. Similarly to Meta-Llama-3-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 6/14/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2-7B-Instruct. It achieves an average score of 69.44 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 69.55. This model was obtained by quantizing the weights and activations of Qwen2-7B-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.0. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with 512 sequences of UltraChat. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying AutoFP8 with calibration samples from ultrachat, as presented in the code snipet below. Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using llm-compressor which supports several quantization schemes and models not supported by AutoFP8. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:

NaNK
license:apache-2.0
16,224
2

DeepSeek-Coder-V2-Lite-Instruct-FP8

15,111
7

Qwen3-8B-speculator.eagle3

Model Overview - Verifier: Qwen/Qwen3-8B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 07/27/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-8B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the HuggingFaceH4/ultrachat200k datasets. The model was trained with thinking turned disabled. This model should be used with the Qwen/Qwen3-8B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.62 1.96 2.13 2.24 2.25 2.29 2.30 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-8B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'

NaNK
license:apache-2.0
13,288
0

Llama-3.2-11B-Vision-Instruct-FP8-dynamic

NaNK
mllama
11,956
24

DeepSeek-R1-Distill-Llama-8B-quantized.w8a8

NaNK
llama
10,577
2

Llama-3.2-1B-Instruct-quantized.w8a8

Model Overview - Model Architecture: Llama-3 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Llama-3.2-1B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 9/25/2024 - Version: 1.0 - License(s): Llama3.2 - Model Developers: Neural Magic Quantized version of Llama-3.2-1B-Instruct. It achieves scores within 5% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA. This model was obtained by quantizing the weights of Llama-3.2-1B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. The SmoothQuant algorithm is used to alleviate outliers in the activations, whereas rhe GPTQ algorithm is applied for quantization. Both algorithms are implemented in the llm-compressor library. GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's LLM compression calibration dataset. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA. Evaluation was conducted using the Neural Magic fork of lm-evaluation-harness (branch llama3.1instruct) and the vLLM engine. This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals. The results were obtained using the following commands:

NaNK
llama
10,123
7

Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic

NaNK
license:apache-2.0
9,380
9

Qwen3-32B-NVFP4A16

NaNK
license:apache-2.0
9,334
1

Qwen3-Coder-Next-NVFP4

license:apache-2.0
9,056
3

Meta-Llama-3.1-8B-FP8

NaNK
llama
8,923
9

Llama-3.1-8B-Instruct-speculator.eagle3

Model Overview - Verifier: meta-llama/Llama-3.1-8B-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 07/27/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-3.1-8B-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the HuggingFaceH4/ultrachat200k datasets. This model should be used with the meta-llama/Llama-3.1-8B-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.70 2.19 2.50 2.78 2.77 2.98 2.99 - temperature: 0.6 - topp: 0.9 - repetitions: 5 - time per experiment: 3min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 180 \ --output-path "Llama-3.1-8B-Instruct-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'

NaNK
llama
8,627
1

Meta-Llama-3-70B-Instruct-FP8

NaNK
llama
8,545
13

Llama-3.2-1B-quantized.w8a8

NaNK
llama
8,320
1

Qwen2.5-VL-3B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
7,775
2

Llama-3.2-90B-Vision-Instruct-FP8-dynamic

NaNK
mllama
7,267
10

Voxtral-Small-24B-2507-FP8-dynamic

Model Overview - Model Architecture: VoxtralForConditionalGeneration - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Intended Use Cases: Voxtral builds upon Ministral-3B with powerful audio understanding capabilities. - Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly - Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding - Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models - Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian) - Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents - Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B - Release Date: 08/21/2025 - Version: 1.0 - Model Developers: Red Hat This model was obtained by quantizing activation and weights of Voxtral-Small-24B-2507 to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the MLP operators within transformers blocks of the language model are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization. 2. Send requests to the server, according to the use case. See the following examples. This model was quantized using the llm-compressor library as shown below. After quantization, the model can be converted back into the mistralai format using the `convertvoxtralhftomistral.py` script included with the model. The model was evaluated on the Fleurs transcription task. Recovery is computed with respect to the complement of the word error rate (WER). Benchmark Language Voxtral-Small-24B-2507 Voxtral-Small-24B-2507-FP8-dynamic (this model) Recovery

NaNK
license:apache-2.0
6,857
0

gemma-3-4b-it-quantized.w4a16

NaNK
6,756
2

Qwen3-30B-A3B-FP8-dynamic

NaNK
license:apache-2.0
6,679
3

Qwen2-1.5B-Instruct-FP8

NaNK
license:apache-2.0
6,000
0

Qwen2.5-VL-3B-Instruct-FP8-dynamic

NaNK
license:apache-2.0
5,878
3

Mistral-Small-3.2-24B-Instruct-2506-NVFP4

Model Overview - Model Architecture: unsloth/Mistral-Small-3.2-24B-Instruct-2506 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of unsloth/Mistral-Small-3.2-24B-Instruct-2506. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of unsloth/Mistral-Small-3.2-24B-Instruct-2506 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. Category Metric unsloth/Mistral-Small-3.2-24B-Instruct-2506 RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 Recovery The results were obtained using the following commands:

NaNK
license:apache-2.0
5,784
3

gemma-3-12b-it-FP8-dynamic

NaNK
license:apache-2.0
5,525
3

Meta-Llama-3.1-70B-Instruct-FP8

NaNK
llama
5,199
50

gemma-3-1b-it-FP8-dynamic

NaNK
5,040
0

Qwen3-30B-A3B-quantized.w4a16

NaNK
license:apache-2.0
4,919
5

Llama-3.2-3B-Instruct-FP8

NaNK
llama
4,817
6

Mistral-Nemo-Instruct-2407-FP8

license:apache-2.0
4,777
18

Meta-Llama-3.1-70B-Instruct-quantized.w8a8

NaNK
llama
4,682
21

DeepSeek-R1-Distill-Llama-70B-quantized.w8a8

NaNK
llama
4,125
2

Meta-Llama-3.1-8B-Instruct-quantized.w8a16

NaNK
llama
3,989
12

Meta-Llama-3-8B-Instruct-quantized.w8a8

NaNK
llama
3,909
2

Qwen2.5-Coder-14B-Instruct-FP8-dynamic

NaNK
license:apache-2.0
3,777
1

DeepSeek-R1-Distill-Qwen-7B-quantized.w8a8

NaNK
license:mit
3,368
4

Qwen3-8B-quantized.w4a16

NaNK
license:apache-2.0
3,249
2

phi-4-quantized.w8a8

license:mit
3,071
2

DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

NaNK
license:mit
2,912
13

DeepSeek-R1-Distill-Qwen-1.5B-quantized.w8a8

NaNK
license:mit
2,904
2

Meta-Llama-3.1-405B-Instruct-FP8-dynamic

NaNK
llama
2,869
15

Llama-4-Maverick-17B-128E-Instruct-NVFP4

NaNK
llama4
2,818
2

DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8

NaNK
license:mit
2,792
2

Llama-3.2-3B-quantized.w8a8

NaNK
llama
2,725
0

Meta-Llama-3-8B-Instruct-FP8

NaNK
llama
2,638
24

Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

NaNK
llama4
2,520
12

gemma-3-4b-it-FP8-dynamic

NaNK
license:apache-2.0
2,184
0

Magistral-Small-2506-FP8

license:apache-2.0
2,110
6

gemma-3-12b-it-quantized.w8a8

NaNK
2,109
3

Qwen2.5-VL-72B-Instruct-FP8-dynamic

NaNK
license:apache-2.0
2,006
11

Qwen2.5-VL-7B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
1,869
8

NVIDIA-Nemotron-Nano-9B-v2-FP8-dynamic

Model Overview - Model Architecture: NemotronHForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 9/30/2025 - V...

NaNK
1,842
3

Qwen3-0.6B-FP8-BLOCK

NaNK
1,828
0

gemma-3-12b-it-quantized.w4a16

NaNK
1,778
2

Qwen3-30B-A3B-NVFP4

Model Overview - Model Architecture: Qwen/Qwen3-30B-A3B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-30B-A3B. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-30B-A3B to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen3-30B-A3B Qwen3-30B-A3B-NVFP4 (this model) Recovery The results were obtained using the following commands:

NaNK
license:apache-2.0
1,773
0

Meta-Llama-3.1-70B-Instruct-FP8-dynamic

NaNK
llama
1,585
7

gpt-oss-120b-FP8-dynamic

Model Overview - Model Architecture: gpt-oss-120b-BF16 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 08/13/2025 - Version: 1.0 - Model Developers: RedHatAI

NaNK
license:apache-2.0
1,493
7

DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16

NaNK
license:mit
1,482
1

Pixtral-Large-Instruct-2411-hf-quantized.w4a16

1,409
0

Llama-4-Scout-17B-16E-Instruct

NaNK
llama4
1,353
0

Qwen3-4B-quantized.w4a16

NaNK
license:apache-2.0
1,297
1

gemma-3-27b-it-quantized.w8a8

NaNK
1,261
7

gpt-oss-120b-speculator.eagle3

NaNK
license:apache-2.0
1,067
0

whisper-large-v3-quantized.w4a16

Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

NaNK
license:apache-2.0
1,017
3

Qwen3-0.6B-FP8-dynamic

NaNK
license:apache-2.0
1,014
0

Llama-3.1-8B-Instruct

NaNK
llama
978
5

Qwen3-14B-FP8-dynamic

NaNK
license:apache-2.0
971
4

Mistral-7B-Instruct-v0.3-FP8

NaNK
license:apache-2.0
964
3

gemma-3-1b-it-quantized.w8a8

Model Overview - Model Architecture: google/gemma-3-1b-it - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 6/4/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3-1b-it to INT8 data type, ready for inference with vLLM >= 0.8.0. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below: The model was evaluated using lmevaluationharness for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands: Category Metric google/gemma-3-1b-it RedHatAI/gemma-3-1b-it-quantized.w8a8 Recovery (%)

NaNK
942
0

Qwen2.5-VL-7B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
888
7

Qwen2.5-VL-3B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
875
3

Llama-2-7b-ultrachat200k

NaNK
llama
872
0

Mistral-Small-24B-Instruct-2501-FP8-dynamic

NaNK
license:apache-2.0
871
13

gemma-3-1b-it-quantized.w4a16

NaNK
863
0

Qwen3.5-35B-A3B-FP8-dynamic

NaNK
license:apache-2.0
851
2

phi-4-quantized.w4a16

license:mit
846
3

gpt-oss-120b

NaNK
license:apache-2.0
770
4

NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

NaNK
767
4

DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic

NaNK
license:mit
741
8

Qwen3-30B-A3B-Instruct-2507-quantized.w4a16

NaNK
license:apache-2.0
720
1

Qwen3-30B-A3B-Instruct-2507.w4a16

NaNK
license:apache-2.0
720
1

DeepSeek-R1-Distill-Llama-8B-quantized.w4a16

NaNK
llama
706
0

Llama-3.1-Nemotron-70B-Instruct-HF-FP8-dynamic

NaNK
llama
701
14

Llama-2-7b-gsm8k

NaNK
llama
680
5

Qwen3-235B-A22B-FP8-dynamic

NaNK
license:apache-2.0
678
2

Mistral-Small-3.1-24B-Instruct-2503-quantized.w8a8

NaNK
license:apache-2.0
658
5

Qwen3-235B-A22B-Instruct-2507-NVFP4

Model Overview - Model Architecture: Qwen/Qwen3-235B-A22B-Instruct-2507 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-235B-A22B-Instruct-2507. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-235B-A22B-Instruct-2507 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-235B-A22B-Instruct-2507 RedHatAI/Qwen3-235B-A22B-Instruct-2507-NVFP4 (this model) Recovery The results were obtained using the following commands:

NaNK
license:apache-2.0
629
4

llama2.c-stories110M-pruned50

NaNK
llama
627
0

DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16

NaNK
license:mit
625
5

granite-4.0-h-small-FP8-dynamic

license:apache-2.0
622
0

Meta-Llama-3.1-70B-FP8

NaNK
llama
595
2

Llama-3.3-70B-Instruct-quantized.w4a16

NaNK
llama
588
3

Qwen3-32B-NVFP4

NaNK
license:apache-2.0
579
3

gemma-3n-E4B-it-FP8-dynamic

NaNK
572
3

Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

NaNK
llama4
563
1

Meta-Llama-3.1-405B-Instruct-FP8

NaNK
llama
553
31

Qwen3-32B-quantized.w4a16

NaNK
license:apache-2.0
539
11

Qwen3-32B-speculator.eagle3

Model Overview - Verifier: Qwen/Qwen3-32B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/17/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-32B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the Qwen/Qwen3-32B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.62 1.95 2.15 2.23 2.27 2.32 2.33 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 2xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-32B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'

NaNK
license:apache-2.0
527
4

Llama-3.3-70B-Instruct

NaNK
llama
522
0

Qwen3-Next-80B-A3B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
507
1

DeepSeek-R1-Distill-Llama-70B-FP8-dynamic

NaNK
llama
500
9

granite-3.1-8b-instruct-quantized.w4a16

NaNK
license:apache-2.0
492
1

Devstral-Small-2507-quantized.w8a8

Model Overview - Model Architecture: MistralForCausalLM - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Release Date: 08/29/2025 - Version: 1.0 - Model Developers: Red Hat (Neural Magic) This model was obtained by quantizing weights and activations of Devstral-Small-2507 to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%). Weight quantization also reduces disk size requirements by approximately 50%. This model was created with llm-compressor by running the code snippet below. This model can be deployed efficiently using the vLLM backend, as shown in the example below. The model was evaluated on popular coding tasks (HumanEval, HumanEval+, MBPP, MBPP+) via EvalPlus and vllm backend (v0.10.1.1). For evaluations, we run greedy sampling and report pass@1. The command to reproduce evals: | | Recovery (%) | mistralai/Devstral-Small-2507 | RedHatAI/Devstral-Small-2507-quantized.w8a8 (this model) | | --------------------------- | :----------: | :------------------: | :--------------------------------------------------: | | HumanEval | 100.67 | 89.0 | 89.6 | | HumanEval+ | 101.48 | 81.1 | 82.3 | | MBPP | 98.71 | 77.5 | 76.5 | | MBPP+ | 102.42 | 66.1 | 67.7 | | Average Score | 100.77 | 78.43 | 79.03 |

NaNK
license:mit
488
1

Qwen3-0.6B-quantized.w4a16

NaNK
license:apache-2.0
481
0

Qwen3-14B-quantized.w4a16

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Intended Use Cases: - Reasoning. - Function calling. - Subject matter experts via fine-tuning. - Multilingual instruction following. - Translation. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 05/05/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing the weights of Qwen3-14B to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a asymmetric per-group scheme, with group size 64. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.

NaNK
license:apache-2.0
460
0

Voxtral-Mini-3B-2507-FP8-dynamic

NaNK
license:apache-2.0
457
9

Qwen2-72B-Instruct-FP8

NaNK
450
15

Mixtral-8x7B-Instruct-v0.1-FP8

NaNK
license:apache-2.0
420
0

whisper-large-v3-turbo-quantized.w4a16

license:apache-2.0
407
6

Mistral-Nemo-Instruct-2407-quantized.w4a16

license:llama2
399
4

granite-4.0-h-tiny-FP8-dynamic

license:apache-2.0
395
1

phi-4-FP8-dynamic

NaNK
license:mit
395
0

Qwen2-0.5B-Instruct-FP8

NaNK
license:apache-2.0
390
3

Llama-4-Scout-17B-16E-Instruct-NVFP4

NaNK
llama4
389
0

Llama-2-7b-chat-quantized.w8a8

NaNK
llama
387
1

whisper-large-v3-FP8-dynamic

Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

NaNK
license:apache-2.0
380
2

Llama-3.2-3B-Instruct-quantized.w8a8

NaNK
llama
376
1

Qwen3-Next-80B-A3B-Instruct-FP8

NaNK
license:apache-2.0
370
0

Llama-3.3-70B-Instruct-speculator.eagle3

Model Overview - Verifier: meta-llama/Llama-3.3-70B-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/15/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-3.3-70B-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the meta-llama/Llama-3.3-70B-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.71 2.21 2.52 2.74 2.83 2.87 2.89 - temperature: 0 - repetitions: 5 - time per experiment: 4min - hardware: 4xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 240 \ --output-path "Llama-3.3-70B-Instruct-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'

NaNK
llama
360
1

DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic

NaNK
license:mit
355
3

Llama-3.2-11B-Vision-Instruct-quantized.w4a16

NaNK
mllama
351
1

Phi-4-reasoning-FP8-dynamic

license:mit
349
1

DeepSeek-R1-Distill-Qwen-7B-quantized.w4a16

NaNK
license:mit
348
2

Qwen3-30B-A3B-Thinking-2507-speculator.eagle3

NaNK
license:apache-2.0
339
0

Qwen2.5-7B-Instruct-FP8-dynamic

NaNK
license:apache-2.0
337
1

Llama-2-7b-pruned70-retrained

NaNK
llama
326
0

DeepSeek-V2.5-1210-FP8

321
2

Apertus-8B-Instruct-2509-FP8-dynamic

NaNK
license:apache-2.0
314
3

whisper-large-v3-quantized.w8a8

Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

NaNK
license:apache-2.0
314
1

SmolLM3-3B-FP8-dynamic

NaNK
license:apache-2.0
308
1

Meta-Llama-3.1-8B-quantized.w8a8

NaNK
llama
303
5

whisper-large-v3-turbo-FP8-dynamic

Model Overview - Model Architecture: whisper-large-v3-turbo - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of openai/whisper-large-v3-turbo. This model was obtained by quantizing the weights of openai/whisper-large-v3-turbo to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

license:apache-2.0
292
5

Qwen3-235B-A22B-NVFP4

Model Overview - Model Architecture: Qwen/Qwen3-235B-A22B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-235B-A22B. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-235B-A22B to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-235B-A22B RedHatAI/Qwen3-235B-A22B-NVFP4 (this model) Recovery The results were obtained using the following commands:

NaNK
license:apache-2.0
290
0

Llama-4-Maverick-17B-128E-Instruct-FP8

NaNK
llama4
286
2

Qwen3-235B-A22B-Instruct-2507-speculator.eagle3

NaNK
license:apache-2.0
286
0

Meta-Llama-3-8B-Instruct-quantized.w8a16

NaNK
llama
278
3

Qwen2-57B-A14B-Instruct-FP8

NaNK
license:apache-2.0
277
1

Phi-3-medium-128k-instruct-quantized.w4a16

license:llama2
275
3

DeepSeek-R1-Distill-Llama-8B-FP8-dynamic

NaNK
llama
273
4

Mistral-7B-Instruct-v0.3-quantized.w4a16

NaNK
license:apache-2.0
269
2

gemma-2-9b-it-FP8

NaNK
266
5

Llama-3.1-8B-Instruct-NVFP4

Model Overview - Model Architecture: Meta-Llama-3.1 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/23/2025 - Version: 1.0 - License(s): llama3.1 - Model Developers: RedHatAI This model is a quantized version of Meta-Llama-3.1-8B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks. All evaluations were conducted using lm-evaluation-harness. Category Metric Meta-Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct-NVFP4 (this model) Recovery gsm8kllama 78.17 79.30 101.45 hellaswag 78.43 78.01 99.46 mmlullama 69.37 65.95 95.07 mmlucotllama 72.86 68.60 94.15 truthfulqamc2 55.09 52.95 96.12 winogrande 75.77 74.03 97.70 Average 73.29 71.59 97.68 Category Metric Meta-Llama-3.1-8B-Instruct RedHatAI/Llama-3.1-8B-Instruct-NVFP4 (this model) Recovery (%) The results were obtained using the following commands:

NaNK
llama
266
0

phi-4

NaNK
license:mit
262
1

Kimi-K2-Instruct-quantized.w4a16

259
12

Apertus-70B-Instruct-2509-quantized.w4a16

NaNK
license:apache-2.0
259
1

Qwen3-4B-FP8-dynamic

NaNK
license:apache-2.0
253
0

gemma-3n-E2B-it-quantized.w8a8

Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights and activations of google/gemma-3n-E2B-it to INT8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it RedHatAI/gemma-3n-E2B-it-quantized.w8a8 Recovery (%)

NaNK
license:mit
251
2

Llama-3.1-70B-Instruct-NVFP4

NaNK
llama
249
0

Qwen2.5-VL-72B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
227
8

Qwen2.5-7B-quantized.w8a8

NaNK
license:apache-2.0
225
1

Llama-4-Scout-17B-16E-Instruct-FP8-block

Model Overview - Model Architecture: Llama4ForConditionalGeneration - Input: Text, Image - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-4-Scout-17B-16E-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-4-Scout-17B-16E-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-4-Scout-17B-16E-Instruct RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 69.62 68.60 98.53 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 89.09 89.93 100.94

NaNK
llama4
218
3

Llama-3.2-1B-FP8

NaNK
llama
215
0

Ministral-3-14B-Instruct-2512

NaNK
license:apache-2.0
211
2

Mistral-Small-24B-Instruct-2501-quantized.w8a8

NaNK
license:apache-2.0
196
1

Mistral-Small-24B-Instruct-2501-quantized.w4a16

NaNK
license:apache-2.0
195
1

QwQ-32B-FP8-dynamic

NaNK
license:mit
188
11

gemma-3-4b-it-quantized.w8a8

NaNK
186
0

gemma-3n-E2B-it-FP8-dynamic

Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E2B-it to FP8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it FP8 Dynamic Recovery (%)

NaNK
license:mit
183
1

whisper-large-v2-FP8-Dynamic

Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 15.2148 14.7614 103.07%

NaNK
license:apache-2.0
182
0

granite-3.1-8b-instruct

NaNK
license:apache-2.0
176
1

DeepSeek-R1-Distill-Llama-70B-quantized.w4a16

NaNK
llama
175
5

bge-base-en-v1.5-quant

license:mit
172
4

Llama-3.3-70B-Instruct-NVFP4

Model Overview - Model Architecture: Meta-Llama-3.3 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.3-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 6/25/2025 - Version: 1.0 - License(s): llama3.3 - Model Developers: RedHatAI This model is a quantized version of Meta-Llama-3.3-70B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Meta-Llama-3.3-70B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval64 benchmarks. All evaluations were conducted using lm-evaluation-harness. Category Metric Meta-Llama-3.3-70B-Instruct RedHatAI/Llama-3.3-70B-Instruct-NVFP4 (this model) Recovery gsm8kllama (8-shot, strict-match) 85.22 77.10 90.47 The results were obtained using the following commands:

NaNK
llama
172
1

TinyLlama-1.1B-Chat-v1.0-marlin

NaNK
llama
170
2

Llama-3.2-3B-Instruct-FP8-dynamic

NaNK
llama
166
3

Mixtral-8x7B-Instruct-v0.1

NaNK
license:apache-2.0
166
1

Apertus-70B-Instruct-2509-FP8-dynamic

NaNK
license:apache-2.0
160
1

Qwen3-8B-NVFP4

NaNK
license:apache-2.0
160
0

DeepSeek-R1-0528-quantized.w4a16

NaNK
license:mit
155
12

gemma-2-2b-it-quantized.w4a16

NaNK
license:llama2
151
1

Qwen3-1.7B-FP8-dynamic

NaNK
license:apache-2.0
147
0

Qwen2-1.5B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
146
0

Llama-3.1-8B-Instruct-FP8-block

Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-3.1-8B-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-3.1-8B-Instruct RedHatAI/Llama-3.1-8B-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 60.92 60.92 100.00 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 81.89 81.41 99.41

NaNK
base_model:meta-llama/Llama-3.1-8B-Instruct
144
1

granite-3.1-8b-instruct-quantized.w8a8

NaNK
license:apache-2.0
135
2

gemma-2-2b-it-FP8

NaNK
135
1

granite-3.1-2b-instruct-quantized.w4a16

NaNK
license:apache-2.0
135
0

Qwen2.5-7B-Instruct

NaNK
license:apache-2.0
135
0

Qwen2.5-VL-72B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
133
0

DeepSeek-R1-Distill-Qwen-7B-FP8-dynamic

NaNK
license:mit
121
1

gemma-3n-E2B-it-quantized.w4a16

Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: INT16 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E2B-it to INT4 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it RedHatAI/gemma-3n-E2B-it-quantized.w4a16 Recovery (%)

NaNK
license:mit
120
1

Qwen3-Next-80B-A3B-Thinking-FP8-dynamic

NaNK
license:apache-2.0
120
0

Mistral-Small-3.1-24B-Instruct-2503

NaNK
license:apache-2.0
116
1

Devstral-Small-2507-quantized.w4a16

Model Overview - Model Architecture: MistralForCausalLM - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT4 - Weight quantization: None - Release Date: 08/29/2025 - Version: 1.0 - Model Developers: Red Hat (Neural Magic) This model was obtained by quantizing weights of Devstral-Small-2507 to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements (by approximately 75%). Weight quantization also reduces disk size requirements by approximately 75%. This model can be deployed efficiently using the vLLM backend, as shown in the example below. This model was created with llm-compressor by running the code snippet below. The model was evaluated on popular coding tasks (HumanEval, HumanEval+, MBPP, MBPP+) via EvalPlus and vllm backend (v0.10.1.1). For evaluations, we run greedy sampling and report pass@1. The command to reproduce evals: | | Recovery (%) | mistralai/Devstral-Small-2507 | RedHatAI/Devstral-Small-2507-quantized.w4a16 (this model) | | --------------------------- | :----------: | :------------------: | :--------------------------------------------------: | | HumanEval | 98.65 | 89.0 | 87.8 | | HumanEval+ | 100.0 | 81.1 | 81.1 | | MBPP | 98.97 | 77.5 | 76.7 | | MBPP+ | 102.12 | 66.1 | 67.5 | | Average Score | 99.81 | 78.43 | 78.28 |

NaNK
license:mit
113
1

gemma-3n-E4B-it-quantized.w8a8

Model Overview - Model Architecture: gemma-3n-E4B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights and activations of google/gemma-3n-E4B-it to INT8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E4B-it RedHatAI/gemma-3n-E4B-it-quantized.w8a8 Recovery (%)

NaNK
license:mit
113
0

gemma-2-2b-it-quantized.w8a8

NaNK
111
0

Qwen2-0.5B-Instruct-quantized.w8a16

NaNK
license:apache-2.0
110
0

Qwen2.5-7B-Instruct-quantized.w8a16

NaNK
109
0

Qwen3.5-122B-A10B-FP8-Dynamic

NaNK
108
0

Qwen2.5-7B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
105
2

Qwen3-VL-235B-A22B-Instruct-FP8-block

NaNK
license:apache-2.0
102
3

Qwen3-8B-FP8-block

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-8B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-8B nm-testing/Qwen3-8B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 67.66 67.92 100.38 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 48.56 48.80 100.49

NaNK
license:apache-2.0
99
0

DeepSeek-R1-Distill-Qwen-1.5B-quantized.w4a16

NaNK
license:mit
95
1

pixtral-12b-FP8-dynamic

NaNK
license:apache-2.0
93
10

Mistral-Small-24B-Instruct-2501

NaNK
license:apache-2.0
93
0

MiniMax-M2.5

91
1

Qwen2.5-72B-Instruct-quantized.w8a8

NaNK
91
0

Llama-3.1-Nemotron-70B-Instruct-HF

NaNK
llama
81
2

Llama-Guard-4-12B

NaNK
llama4
81
0

Meta-Llama-3-70B-Instruct-quantized.w8a16

NaNK
llama
80
5

NVIDIA-Nemotron-3-Super-120B-A12B-BF16

NaNK
80
0

NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16

Model Overview - Model Architecture: NemotronHForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 10/22/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing the weights of NVIDIA-Nemotron-Nano-9B-v2 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 64. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the set of popular reasoning tasks AIME25, Math-500, and GPQA-Diamond, using lighteval `v0.11.1.dev0`. vLLM `v0.11.1rc2.dev191+g80e945298.precompiled` was used as the inference engine for all evaluations. NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 (this model)

NaNK
77
5

nomic-embed-text-v1.5

license:apache-2.0
77
0

gemma-2-9b-it

NaNK
74
1

Meta-Llama-3.1-405B-Instruct-quantized.w4a16

NaNK
llama
70
12

Qwen3-Next-80B-A3B-Instruct-FP8-dynamic

NaNK
license:apache-2.0
69
0

Ministral-3-3B-Instruct-2512

NaNK
license:apache-2.0
68
0

Mistral-Small-3.2-24B-Instruct-2506-FP8

NaNK
license:apache-2.0
67
6

DeepSeek-Coder-V2-Instruct-FP8

66
7

Sparse-Llama-3.1-8B-2of4

NaNK
base_model:meta-llama/Llama-3.1-8B
65
62

Qwen3-Next-80B-A3B-Thinking-quantized.w4a16

NaNK
license:apache-2.0
62
0

Qwen3-14B-speculator.eagle3

Model Overview - Verifier: Qwen/Qwen3-14B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/18/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-14B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the Qwen/Qwen3-14B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.60 1.90 2.06 2.14 2.17 2.19 2.21 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-14B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'

NaNK
license:apache-2.0
62
0

Qwen3-1.7B-quantized.w4a16

NaNK
license:apache-2.0
61
3

Llama-2-7b-chat-hf-FP8

NaNK
llama
61
0

GLM-4.6-NVFP4

NaNK
60
0

SmolLM3-3B-quantized.w4a16

Model Overview - Model Architecture: SmolLM3-3B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: None - Release Date: 07/31/2025 - Version: 1.0 - License(s): Apache-2.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing weights of SmolLM3-3B to INT4 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%). Weight quantization also reduces disk size requirements by approximately 75%. Only weights of the linear operators within transformers blocks are quantized. The llm-compressor library is used for quantization. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below with: This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond. In all cases, model outputs were generated with the vLLM engine, and evals are collected through LightEval library.

NaNK
license:apache-2.0
59
1

Llama-3.1-8B-tldr-FP8-dynamic

Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 06/06/2025 - Version: 1.0 - Intended Use Cases: This model is finetuned to summarize text in the style of Reddit posts. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. - Model Developers: Red Hat (Neural Magic) This model is a quantized version of RedHatAI/Llama-3.1-8B-tldr, which is fine-tuned on the trl-lib/tldr dataset. This model recovers 100% of the BERTScore (0.366) obtained by RedHatAI/Llama-3.1-8B-tldr while providing up to 1.3x speedup. This model can be deployed efficiently using vLLM, as shown in the example below. Run the following command to start the vLLM server: Once your server is started, you can query the model using the OpenAI API: This model was created by applying llm-compressor, as presented in the code snipet below. The model was evaluated on the test split of trl-lib/tldr using the Neural Magic fork of lm-evaluation-harness (tldr branch). One can reproduce these results by using the following command: We evaluated the inference performance of this model using the first 1,000 samples from the training set of the trl-lib/tldr dataset. Benchmarking was conducted with vLLM version `0.9.0.1` and GuideLLM version `0.2.1`. The figure below presents the mean end-to-end latency per request across varying request rates. Results are shown for this model, as well as two variants: - Dense: Llama-3.1-8B-tldr - Sparse-quantized: Sparse-Llama-3.1-8B-tldr-2of4-FP8-dynamic 1. Generate a JSON file containing the first 1,000 training samples: > The average output length is approximately 30 tokens per sample. We capped the generation at 128 tokens to reduce performance skew from rare, unusually verbose completions.

NaNK
llama
58
1

Qwen2.5-7B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
58
0

gemma-2-2b-it-quantized.w8a16

NaNK
57
1

Mixtral-8x22B-Instruct-v0.1-FP8

NaNK
license:apache-2.0
57
0

Qwen3-14B-FP8-block

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-14B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-14B nm-testing/Qwen3-14B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 69.71 69.80 100.12 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 48.56 48.80 100.49

NaNK
license:apache-2.0
57
0

granite-3.1-8b-instruct-FP8-dynamic

NaNK
license:apache-2.0
54
1

Qwen3-14B-NVFP4

NaNK
license:apache-2.0
54
0

granite-3.3-8b-instruct

Model Summary: Granite-3.3-8B-Instruct is a 8-billion parameter 128K context length language model fine-tuned for improved reasoning and instruction-following capabilities. Built on top of Granite-3.3-8B-Base, the model delivers significant gains on benchmarks for measuring generic performance including AlpacaEval-2.0 and Arena-Hard, and improvements in mathematics, coding, and instruction following. It supports structured reasoning through \ \ and \ \ tags, providing clear separation between internal thoughts and final outputs. The model has been trained on a carefully balanced combination of permissively licensed data and curated synthetic tasks. - Developers: Granite Team, IBM - Website: Granite Docs - Release Date: April 16th, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages. Intended Use: This model is designed to handle general instruction-following tasks and can be integrated into AI assistants across various domains, including business applications. Capabilities Thinking Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.3-8B-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Comparison with different models over various benchmarks 1 . Scores of AlpacaEval-2.0 and Arena-Hard are calculated with thinking=True Models Arena-Hard AlpacaEval-2.0 MMLU PopQA TruthfulQA BigBenchHard 2 DROP 3 GSM8K HumanEval HumanEval+ IFEval AttaQ Granite-3.1-2B-Instruct 23.3 27.17 57.11 20.55 59.79 61.82 20.99 67.55 79.45 75.26 63.59 84.7 Granite-3.2-2B-Instruct 24.86 34.51 57.18 20.56 59.8 61.39 23.84 67.02 80.13 73.39 61.55 83.23 Granite-3.3-2B-Instruct 28.86 43.45 55.88 18.4 58.97 63.91 44.33 72.48 80.51 75.68 65.8 87.47 Llama-3.1-8B-Instruct 36.43 27.22 69.15 28.79 52.79 73.43 71.23 83.24 85.32 80.15 79.10 83.43 DeepSeek-R1-Distill-Llama-8B 17.17 21.85 45.80 13.25 47.43 67.39 49.73 72.18 67.54 62.91 66.50 42.87 Qwen-2.5-7B-Instruct 25.44 30.34 74.30 18.12 63.06 69.19 64.06 84.46 93.35 89.91 74.90 81.90 DeepSeek-R1-Distill-Qwen-7B 10.36 15.35 50.72 9.94 47.14 67.38 51.78 78.47 79.89 78.43 59.10 42.45 Granite-3.1-8B-Instruct 37.58 30.34 66.77 28.7 65.84 69.87 58.57 79.15 89.63 85.79 73.20 85.73 Granite-3.2-8B-Instruct 55.25 61.19 66.79 28.04 66.92 71.86 58.29 81.65 89.35 85.72 74.31 84.7 Granite-3.3-8B-Instruct 57.56 62.68 65.54 26.17 66.86 69.13 59.36 80.89 89.73 86.09 74.82 88.5 Training Data: Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites. Infrastructure: We train Granite-3.3-8B-Instruct using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite-3.3-8B-Instruct builds upon Granite-3.3-8B-Base, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.3-8B-Base remain relevant. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://github.com/ibm-granite-community/ [1] Evaluated using OLMES (except AttaQ and Arena-Hard scores) [2] Added regex for more efficient asnwer extraction. [3] Modified the implementation to handle some of the issues mentioned here

NaNK
license:apache-2.0
52
0

Mistral-7B-Instruct-v0.3-quantized.w8a16

NaNK
license:apache-2.0
52
0

pixtral-12b-quantized.w4a16

NaNK
license:apache-2.0
51
1

Qwen3-4B-Thinking-2507-quantized.w4a16

NaNK
license:apache-2.0
51
0

bge-small-en-v1.5-quant

license:mit
48
9

Qwen3-4B-Thinking-2507-quantized.w8a8

NaNK
license:apache-2.0
48
0

Devstral-Small-2-24B-Instruct-2512

NaNK
license:apache-2.0
48
0

Qwen2.5-72B-Instruct-quantized.w4a16

NaNK
47
0

Qwen2-VL-72B-Instruct-FP8-dynamic

NaNK
license:apache-2.0
47
0

gemma-2-2b-quantized.w8a16

NaNK
46
0

all-MiniLM-L6-v2

license:apache-2.0
45
0

gemma-3n-E4B-it-quantized.w4a16

Model Overview - Model Architecture: gemma-3n-E4B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: INT16 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E4B-it to INT4 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E4B-it RedHatAI/gemma-3n-E4B-it-quantized.w4a16 Recovery (%)

NaNK
license:mit
45
0

pixtral-12b-quantized.w8a8

NaNK
license:apache-2.0
42
1

Mistral-Large-Instruct-2407-FP8

42
0

bert-large-uncased-finetuned-squadv1

41
1

Phi-3-vision-128k-instruct-W4A16-G128

Model Overview - Model Architecture: Phi-3-vision-128k-instruct - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 1/31/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of microsoft/Phi-3-vision-128k-instruct. This model was obtained by quantizing the weights of microsoft/Phi-3-vision-128k-instruct to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

license:apache-2.0
41
1

Qwen3-30B-A3B-Thinking-speculator.eagle3

NaNK
license:apache-2.0
39
0

whisper-small-FP8-Dynamic

Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 24.6761 93.50%

license:apache-2.0
39
0

DeepSeek-V3.2-NVFP4-FP8-BLOCK

NaNK
license:mit
38
0

whisper-large-v3-turbo-quantized.w8a8

Model Overview - Model Architecture: whisper-large-v3-turbo - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of openai/whisper-large-v3-turbo. This model was obtained by quantizing the weights of openai/whisper-large-v3-turbo to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

license:apache-2.0
37
4

embeddinggemma-300m

37
0

Llama-4-Maverick-17B-128E-Instruct-FP8-block

Model Overview - Model Architecture: Llama4ForConditionalGeneration - Input: Text, Image - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-4-Maverick-17B-128E-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-4-Maverick-17B-128E-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric meta-llama/Llama-4-Maverick-17B-128E-Instruct RedHatAI/Llama-4-Maverick-17B-128E-Instruct-block-FP8 Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 73.38 73.38 100.00 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 89.93 90.89 101.07

NaNK
llama4
36
1

Qwen3-30B-A3B-FP8-block

NaNK
license:apache-2.0
36
0

Llama-Guard-4-12B-FP8-dynamic

NaNK
llama4
35
0

granite-3.1-2b-instruct-FP8-dynamic

NaNK
license:apache-2.0
35
0

DeepSeek-R1-Distill-Qwen-1.5B-FP8-dynamic

NaNK
license:mit
35
0

GLM-4.6-quantized.w8a8

NaNK
33
0

GLM-4.6-FP8-dynamic

NaNK
33
0

Qwen2.5-0.5B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
33
0

TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds

NaNK
llama
31
0

zephyr-7b-beta-marlin

NaNK
31
0

Llama-3.1-Nemotron-70B-Instruct-HF-quantized.w8a8

NaNK
llama
31
0

gemma-3-12b-it

NaNK
30
1

Qwen3-32B-Thinking-speculator.eagle3

NaNK
license:apache-2.0
30
0

Phi-3-mini-128k-instruct-quantized.w8a8

license:mit
30
0

Mistral-Large-3-675B-Instruct-2512-NVFP4

NaNK
license:apache-2.0
29
1

Mistral-Large-3-675B-Instruct-2512

NaNK
license:apache-2.0
28
1

Qwen2.5-Coder-7B-FP8-dynamic

NaNK
28
0

Phi-3.5-mini-instruct-FP8-KV

license:mit
27
2

Llama-2-7b-evolcodealpaca

NaNK
llama
27
1

Qwen3-Coder-480B-A35B-Instruct-FP8

NaNK
license:apache-2.0
26
3

Qwen2.5-1.5B-quantized.w4a16

NaNK
license:apache-2.0
26
0

Llama-Guard-4-12B-quantized.w8a8

NaNK
llama4
25
0

Qwen2-0.5B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
25
0

Qwen2.5-32B-quantized.w4a16

NaNK
25
0

Llama-2-7b-chat-quantized.w8a16

NaNK
llama
24
0

Qwen2.5-32B-Instruct-FP8-dynamic

NaNK
24
0

Llama-3.1-Nemotron-70B-Instruct-HF-quantized.w4a16

NaNK
llama
24
0

Llama-3.3-70B-Instruct-FP8-block

Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-3.3-70B-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-3.3-70B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-3.3-70B-Instruct nm-testing/Llama-3.3-70B-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 72.53 72.61 100.12 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 92.57 92.57 100.00

NaNK
base_model:meta-llama/Llama-3.3-70B-Instruct
24
0

OpenHermes-2.5-Mistral-7B-marlin

NaNK
23
2

gemma-2-9b-it-quantized.w4a16

NaNK
license:llama2
23
2

gemma-2-9b-it-quantized.w8a16

NaNK
23
1

Qwen2-1.5B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
23
0

Qwen3-32B-FP8-block

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-32B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-32B nm-testing/Qwen3-32B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 72.95 72.78 99.77 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 49.04 49.28 100.49

NaNK
license:apache-2.0
23
0

bge-large-en-v1.5-quant

license:mit
22
22

Meta-Llama-3-70B-Instruct-FP8-KV

NaNK
llama
22
2

Qwen3-4B-Instruct-2507-quantized.w4a16

NaNK
license:apache-2.0
22
0

Qwen3-4B-Instruct-2507-quantized.w8a8

NaNK
license:apache-2.0
22
0

Llama-3.1-8B-tldr

NaNK
llama
21
2

Qwen3-235B-A22B-speculator.eagle3

NaNK
license:apache-2.0
21
1

Qwen2.5-0.5B-quantized.w8a8

NaNK
license:apache-2.0
21
0

Meta-Llama-3.1-405B-Instruct-quantized.w8a8

NaNK
llama
20
2

granite-3.1-8b-base-quantized.w8a8

NaNK
license:apache-2.0
20
0

GLM-4.6-quantized.w4a16

NaNK
19
0

granite-embedding-english-r2

license:apache-2.0
19
0

Llama-2-7b-gsm8k-pruned_70

NaNK
llama
19
0

granite-3.1-2b-instruct-quantized.w8a8

NaNK
license:apache-2.0
19
0

granite-3.1-2b-base-quantized.w8a8

NaNK
license:apache-2.0
19
0

Qwen2-1.5B-Instruct-quantized.w8a16

NaNK
license:apache-2.0
18
0

DeepSeek-R1-quantized.w4a16

NaNK
license:mit
17
7

Mistral-7B-Instruct-v0.3-quantized.w8a8

NaNK
license:apache-2.0
17
2

granite-3.1-8b-base-quantized.w4a16

NaNK
license:apache-2.0
17
1

Qwen3-Embedding-8B

NaNK
license:apache-2.0
17
0

Qwen2.5-7B-quantized.w4a16

NaNK
license:apache-2.0
16
0

Pixtral-Large-Instruct-2411-hf-quantized.w8a8

16
0

oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp

15
0

Llama4-Maverick-17B-128E-Instruct-speculator.eagle3

Llama-4-Maverick-17B-128E-Instruct-speculators.eagle3 Model Overview - Verifier: meta-llama/Llama-4-Maverick-17B-128E-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/17/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-4-Maverick-17B-128E-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was converted into the speculators format from the model nvidia/Llama-4-Maverick-17B-128E-Eagle3. This model should be used with the meta-llama/Llama-4-Maverick-17B-128E-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.69 2.12 2.37 2.52 2.60 2.63 2.63 - temperature: 0.6 - topp: 0.9 - repetitions: 3 - time per experiment: 3min - hardware: 8xB200 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 If you use this model, please cite both the original NVIDIA model and the Speculators library: - Original model by NVIDIA Corporation - Conversion and formatting for Speculators/vLLM compatibility - Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier

NaNK
llama3
14
0

Qwen2-7B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
14
0

TinyLlama-1.1B-Chat-v1.0-pruned2.4

NaNK
llama
13
1

QwQ-32B-quantized.w8a8

NaNK
13
0

Meta-Llama-3.1-70B-Instruct-quantized.w8a16

NaNK
llama
12
5

Mistral-Small-4-119B-2603-NVFP4

NaNK
license:apache-2.0
12
1

NVIDIA-Nemotron-3-Super-120B-A12B-FP8

NaNK
12
0

starcoder2-15b-quantized.w8a16

NaNK
12
0

Qwen2.5-32B-Instruct-quantized.w4a16

NaNK
12
0

Qwen3-30B-A3B-Thinking-2507-quantized.w8a8

NaNK
license:apache-2.0
11
0

bge-base-en-v1.5-dense

license:mit
11
0

Llama-2-7b_oneshot-pruned70_C4_10k

NaNK
llama
11
0

Qwen2-72B-Instruct-quantized.w8a16

NaNK
license:apache-2.0
10
1

oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli

10
0

oBERT-6-downstream-pruned-unstructured-90-squadv1

10
0

mpt-7b-gsm8k-pruned50-quant-ds

NaNK
10
0

Qwen2-0.5B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
10
0

Mixtral-8x7B-Instruct-v0.1-AutoFP8

NaNK
license:apache-2.0
9
3

Qwen2.5-14B-quantized.w8a8

NaNK
license:apache-2.0
9
2

oBERT-6-downstream-pruned-block4-80-squadv1

9
1

Qwen3-Next-80B-A3B-Instruct-FP8-block

NaNK
license:apache-2.0
9
0

oBERT-12-upstream-pruned-unstructured-97-finetuned-squadv1

9
0

whisper-small-quantized.w8a8

Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 24.1052 95.68%

license:apache-2.0
9
0

Meta-Llama-3-8B-Instruct-quantized.w4a16

NaNK
llama
8
2

Llama-4-Maverick-17B-128E-Instruct

NaNK
llama4
8
2

Phi-3-mini-128k-instruct-quantized.w4a16

license:llama2
8
1

Qwen2-72B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
8
1

Qwen2.5-72B-FP8-dynamic

NaNK
license:apache-2.0
8
1

whisper-small-quantized.w4a16

Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 25.5212 90.37%

license:apache-2.0
8
1

Qwen3-30B-A3B-Instruct-2507-quantized.w8a8

NaNK
license:apache-2.0
8
0

bge-small-en-v1.5-dense

license:mit
8
0

Llama-2-7b-ultrachat200k-pruned_70-quantized-deepsparse

NaNK
llama
8
0

Phi-3-mini-128k-instruct-FP8

license:mit
8
0

Qwen2-7B-Instruct-quantized.w8a8

NaNK
license:apache-2.0
8
0

granite-3.1-2b-base-quantized.w4a16

NaNK
license:apache-2.0
8
0

Qwen2-VL-72B-Instruct-quantized.w4a16

Model Overview - Model Architecture: Qwen/Qwen2-VL-72B-Instruct - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 2/24/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of Qwen/Qwen2-VL-72B-Instruct to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. The model was evaluated using mistral-evals for vision-related tasks and using lmevaluationharness for select text-based benchmarks. The evaluations were conducted using the following commands: Vision Tasks - vqav2 - docvqa - mathvista - mmmu - chartqa Category Metric Qwen/Qwen2-VL-72B-Instruct nm-testing/Qwen2-VL-72B-Instruct-quantized.W4A16 Recovery (%) Vision MMMU (val, CoT) explicitpromptrelaxedcorrectness 62.11 60.11 96.78% ChartQA (test, CoT) anywhereinanswerrelaxedcorrectness 83.40 80.72 96.78% Mathvista (testmini, CoT) explicitpromptrelaxedcorrectness 66.57 64.66 97.13% This model achieves up to 3.7x speedup in single-stream deployment and up to 3.3x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM. Single-stream performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Number of GPUs Model Average Cost Reduction Latency (s) QPD Latency (s)th> QPD Latency (s) QPD 2 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8 1.85 7.2 139 4.9 206 4.8 211 1 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 3.32 10.0 202 5.0 398 4.8 419 2 neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic 1.79 4.7 119 3.3 173 3.2 177 1 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 2.60 6.4 172 4.3 253 4.2 259 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025). Multi-stream asynchronous performance (measured with vLLM version 0.7.2) Document Visual Question Answering 1680W x 2240H 64/128 Visual Reasoning 640W x 480H 128/128 Image Captioning 480W x 360H 0/128 Hardware Model Average Cost Reduction Maximum throughput (QPS) QPD Maximum throughput (QPS) QPD Maximum throughput (QPS) QPD neuralmagic/Qwen2-VL-72B-Instruct-quantized.w8a8 1.84 0.6 293 2.0 1021 2.3 1135 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 2.73 0.6 314 3.2 1591 4.0 2019 neuralmagic/Qwen2-VL-72B-Instruct-FP8-Dynamic 1.70 0.8 236 2.2 623 2.4 669 neuralmagic/Qwen2-VL-72B-Instruct-quantized.w4a16 2.35 1.3 350 3.3 910 3.6 994 Use case profiles: Image Size (WxH) / prompt tokens / generation tokens QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

NaNK
license:apache-2.0
8
0

whisper-large-v2-quantized.w8a8

Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 15.2148 15.4498 98.48%

NaNK
license:apache-2.0
8
0

gemma-2-9b-it-quantized.w8a8

NaNK
7
2

Qwen2.5-14B-FP8-dynamic

NaNK
license:apache-2.0
7
2

DeepSeek-Coder-V2-Instruct-0724-quantized.w4a16

7
1

oBERT-12-downstream-pruned-unstructured-80-mnli

7
0

oBERT-6-downstream-pruned-unstructured-80-squadv1

7
0

oBERT-12-upstream-pruned-unstructured-97-finetuned-squadv1-v2

7
0

Qwen2-7B-Instruct-quantized.w8a16

NaNK
license:apache-2.0
7
0

Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16 Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Weight quantization: INT4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is a code completion AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the evol-codealpaca-v1 dataset, followed by quantization On the HumanEval benchmark, it achieves a pass@1 of 50.6, compared to 48.5 for the fine-tuned dense model Llama-3.1-8B-evolcodealpaca — demonstrating over 100% accuracy recovery. This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-evolcodealpaca-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-evolcodealpaca-2of4. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on Neural Magic's fork of EvalPlus. Metric Llama-3.1-8B-evolcodealpaca Sparse-Llama-3.1-8B-evolcodealpaca-2of4 Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

NaNK
llama
7
0

Qwen2.5-7B-FP8-dynamic

NaNK
license:apache-2.0
7
0

Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamic

NaNK
llama
6
1

oBERT-12-downstream-pruned-unstructured-90-mnli

6
0

oBERT-teacher-qqp

6
0

oBERT-12-upstream-pruned-unstructured-90-finetuned-mnli

6
0

oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp-v2

6
0

MiniChat-3B-pruned50-quant-ds

NaNK
llama
6
0

Llama-2-7b-ultrachat200k-pruned_50

NaNK
llama
6
0

Llama-2-7b-dolphin-open_platypus-pruned_50-quantized-deepsparse

NaNK
llama
6
0

starcoder2-3b-FP8

NaNK
6
0

starcoder2-7b-quantized.w8a16

NaNK
6
0

Qwen2.5-14B-Instruct-FP8-dynamic

NaNK
6
0

Phi-3-medium-128k-instruct-FP8

license:mit
5
5

bge-small-en-v1.5-sparse

license:mit
5
4

mpt-7b-gsm8k-pruned80-quant-ds

NaNK
5
2

Sparse-Llama-3.1-8B-ultrachat_200k-2of4

Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the ultrachat200k dataset. On the AlpacaEval benchmark (version 1), it achieves a score of 61.1, compared to 62.0 for the fine-tuned dense model Llama-3.1-8B-ultrachat200k — demonstrating a 98.5% accuracy recovery. This inherits the optimizations from its parent, Sparse-Llama-3.1-8B-2of4. Namely, all linear operators within transformer blocks were pruned to the 2:4 sparsity pattern: in each group of four weights, two are retained while two are pruned. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on Neural Magic's fork of AlpacaEval benchmark. We adopt the same setup as in Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, using version 1 of the benchmark and Llama-2-70b-chat as the annotator. Metric Llama-3.1-8B-ultrachat200k Sparse-Llama-3.1-8B-ultrachat200k-2of4

NaNK
llama
5
1

oBERT-teacher-squadv1

5
0

oBERT-12-downstream-pruned-unstructured-80-squadv1

5
0

oBERT-teacher-mnli

5
0

oBERT-12-downstream-pruned-unstructured-90-qqp

5
0

oBERT-12-upstream-pruned-unstructured-90

5
0

oBERT-12-upstream-pruned-unstructured-90-v2

5
0

oBERT-12-upstream-pruned-unstructured-90-finetuned-qqp-v2

5
0

OpenHermes-2.5-Mistral-7B-pruned2.4

NaNK
5
0

Nous-Hermes-2-SOLAR-10.7B-pruned2.4

NaNK
llama
5
0

Llama-2-7b-evol-code-alpaca-pruned_70-quantized-deepsparse

NaNK
llama
5
0

starcoder2-15b-FP8

NaNK
5
0

starcoder2-7b-FP8

NaNK
5
0

starcoder2-15b-quantized.w8a8

NaNK
5
0

Qwen2.5-0.5B-quantized.w4a16

NaNK
license:apache-2.0
5
0

ToolACE-2-Llama-3.1-8B-FP8-dynamic

NaNK
llama
5
0

Nous-Hermes-2-Yi-34B-marlin

NaNK
llama
4
5

Qwen2-72B-Instruct-quantized.w4a16

NaNK
license:apache-2.0
4
4

Phi-3-medium-128k-instruct-quantized.w8a8

license:mit
4
2

mpt-7b-gsm8k-pruned70-quant-ds

NaNK
4
1

Pixtral-Large-Instruct-2411-hf-FP8-dynamic

NaNK
4
1

oBERT-12-downstream-pruned-unstructured-97-mnli

4
0

oBERT-12-downstream-pruned-unstructured-97-qqp

4
0

oBERT-12-upstream-pretrained-dense

4
0

oBERT-12-upstream-pruned-unstructured-97

4
0

oBERT-6-downstream-dense-squadv1

4
0

oBERT-12-upstream-pruned-unstructured-90-finetuned-mnli-v2

4
0

mpt-7b-gsm8k-pruned60-quant-ds

NaNK
4
0

mpt-7b-gsm8k-pruned60-pt

NaNK
4
0

llama-2-7b-chat-marlin

NaNK
llama
4
0

Llama-2-7b-ultrachat200k-pruned_50-quantized-deepsparse

NaNK
llama
4
0

Llama-2-7b-evol-code-alpaca-pruned_50

NaNK
llama
4
0

DeepSeek-Coder-V2-Base-FP8

4
0

Qwen2.5-1.5B-quantized.w8a16

NaNK
license:apache-2.0
4
0

Qwen2.5-1.5B-FP8-dynamic

Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Qwen2.5-1.5B, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 11/27/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2.5-1.5B. It achieves an average score of 58.34 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 58.48. This model was obtained by quantizing the weights of Qwen2.5-1.5B to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:

NaNK
license:apache-2.0
4
0

Qwen2.5-3B-FP8-dynamic

Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Qwen2.5-3B, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 11/27/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2.5-3B. It achieves an average score of 62.50 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 63.59. This model was obtained by quantizing the weights of Qwen2.5-3B to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:

NaNK
license:apache-2.0
4
0

Qwen2.5-72B-Instruct-FP8-dynamic

NaNK
4
0

granite-3.1-8b-instruct-GGUF

NaNK
license:apache-2.0
4
0

Qwen2.5-3B-quantized.w4a16

NaNK
license:apache-2.0
4
0

Mixtral-8x7B-v0.1-quantized.w4a16

NaNK
license:apache-2.0
4
0

whisper-medium-quantized.w8a8

Model Overview - Model Architecture: whisper-medium - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-medium to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 13.3371 12.6123 105.75%

license:apache-2.0
4
0

SOLAR-10.7B-Instruct-v1.0-pruned50-quant-ds

NaNK
llama
3
5

Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16

NaNK
llama
3
3

Meta-Llama-3-70B-Instruct-quantized.w4a16

NaNK
llama
3
2

bge-base-en-v1.5-sparse

license:mit
3
1

SmolLM-135M-Instruct-quantized.w8a8

NaNK
llama
3
1

Sparse-Llama-3.1-8B-gsm8k-2of4-FP8-dynamic

NaNK
llama
3
1

DeepSeek-V3-BF16

3
1

whisper-large-v2-quantized.w4a16

Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 15.2148 23.5763 64.53%

NaNK
license:apache-2.0
3
1

oBERT-12-downstream-pruned-unstructured-97-squadv1

3
0

oBERT-3-upstream-pretrained-dense

3
0

oBERT-12-upstream-pruned-unstructured-90-finetuned-qqp

3
0

oBERT-3-downstream-pruned-block4-80-squadv1

3
0

oBERT-6-downstream-pruned-block4-80-QAT-squadv1

3
0

bge-large-en-v1.5-dense

license:mit
3
0

zephyr-7b-beta-pruned50-quant-ds

NaNK
3
0

Nous-Hermes-2-Yi-34B-pruned2.4

NaNK
llama
3
0

Nous-Hermes-2-Yi-34B-pruned50

NaNK
llama
3
0

llama2.c-stories110M-pruned2.4

NaNK
llama
3
0

Llama-2-7b-cnn-daily-mail-pruned_70-quantized-deepsparse

NaNK
llama
3
0

SparseLLama-2-7b-ultrachat_200k-pruned_50.2of4

NaNK
llama
3
0

SparseLlama-2-7b-evolcodealpaca-pruned_50.2of4

NaNK
llama
3
0

Llama-2-7b-chat-quantized.w4a16

NaNK
llama
3
0

Meta-Llama-3-70B-Instruct-quantized.w8a8

NaNK
llama
3
0

gemma-2-27b-it-quantized.w8a16

NaNK
3
0

SmolLM-135M-Instruct-quantized.w8a16

NaNK
llama
3
0

SmolLM-360M-Instruct-quantized.w8a8

NaNK
llama
3
0

Qwen2.5-72B-quantized.w8a8

NaNK
license:apache-2.0
3
0

Qwen2.5-32B-quantized.w8a16

NaNK
license:apache-2.0
3
0

granite-3.1-2b-base-FP8-dynamic

Model Overview - Model Architecture: granite-3.1-2b-base - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 1/8/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of ibm-granite/granite-3.1-2b-base. It achieves an average score of 57.37 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 57.65. This model was obtained by quantizing the weights and activations of ibm-granite/granite-3.1-2b-base to FP8 data type, ready for inference with vLLM >= 0.5.2. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on OpenLLM Leaderboard V1, OpenLLM Leaderboard V2 and on HumanEval, using the following commands: Category Metric ibm-granite/granite-3.1-2b-base neuralmagic/granite-3.1-2b-base-FP8-dynamic Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 53.75 53.50 99.54 This model achieves up to 1.2x speedup in single-stream deployment on L40 GPUs. The following performance benchmarks were conducted with vLLM version 0.6.6.post1, and GuideLLM. Single-stream performance (measured with vLLM version 0.6.6.post1) GPU class Model Speedup Code Completion prefill: 256 tokens decode: 1024 tokens Docstring Generation prefill: 768 tokens decode: 128 tokens Code Fixing prefill: 1024 tokens decode: 1024 tokens RAG prefill: 1024 tokens decode: 128 tokens Instruction Following prefill: 256 tokens decode: 128 tokens Multi-turn Chat prefill: 512 tokens decode: 256 tokens Large Summarization prefill: 4096 tokens decode: 512 tokens granite-3.1-2b-base-FP8-dynamic (this model) 1.26 7.3 0.9 7.4 1.0 0.9 1.8 4.1 granite-3.1-2b-base-quantized.w4a16 1.88 4.8 0.6 4.9 0.6 0.6 1.2 2.8

NaNK
license:apache-2.0
3
0

Llama-3.1-70B-Instruct-NVFP4A16

NaNK
llama
3
0

watt-tool-8B-FP8-dynamic

NaNK
llama
3
0

mpt-7b-chat-pruned50-quant-ds

NaNK
2
4

Mixtral-8x22B-Instruct-v0.1-AutoFP8

NaNK
license:apache-2.0
2
3

mobilebert-uncased-finetuned-squadv1

2
1

OpenHermes-2.5-Mistral-7B-pruned50

NaNK
2
1

Llama-2-7b-dolphin-open_platypus-pruned_70-quantized-deepsparse

NaNK
llama
2
1

Meta-Llama-3.1-8B-quantized.w8a16

NaNK
llama
2
1

Meta-Llama-3.1-405B-Instruct-quantized.w8a16

NaNK
llama
2
1

Qwen2.5-7B-quantized.w8a16

NaNK
license:apache-2.0
2
1

Sparse-Llama-3.1-8B-gsm8k-2of4

NaNK
llama
2
1

oBERT-12-downstream-pruned-unstructured-90-squadv1

2
0

oBERT-12-upstream-pruned-unstructured-90-finetuned-squadv1

2
0

oBERT-12-downstream-pruned-block4-80-squadv1

2
0

oBERT-12-downstream-pruned-block4-90-squadv1

2
0

oBERT-3-downstream-pruned-unstructured-80-squadv1

2
0

oBERT-6-downstream-dense-QAT-squadv1

2
0

oBERT-6-downstream-pruned-block4-90-QAT-squadv1

2
0

oBERT-12-upstream-pruned-unstructured-97-v2

2
0

oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli-v2

2
0

mpt-7b-gsm8k-pruned75-quant-ds

NaNK
2
0

Llama-2-7b-evol-code-alpaca-pruned_70

NaNK
llama
2
0

Llama-2-7b-evol-code-alpaca-pruned_50-quantized-deepsparse

NaNK
llama
2
0

Llama-2-7b-dolphin-open_platypus-pruned_50

NaNK
llama
2
0

Llama-2-7b-cnn-daily-mail-pruned_50-quantized-deepsparse

NaNK
llama
2
0

Phi-3-mini-128k-instruct-quantized.w8a16

NaNK
license:mit
2
0

starcoder2-7b-quantized.w8a8

NaNK
2
0

Phi-3-small-128k-instruct-quantized.w8a16

NaNK
license:mit
2
0

Qwen2.5-72B-quantized.w8a16

NaNK
license:apache-2.0
2
0

granite-3.0-8b-instruct-GGUF

NaNK
license:apache-2.0
2
0

granite-3.0-2b-instruct-GGUF

NaNK
license:apache-2.0
2
0

Sparse-Llama-3.1-8B-gsm8k-2of4-quantized.w4a16

Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Weight quantization: INT4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is AI model especialized in grade-school math obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the GSM8k dataset, followed by one-shot quantization. It achieves 64.3% 0-shot accuracy on the test set of GSM8k, compared to 66.3% for the fine-tuned dense model Llama-3.1-8B-gsm8k — demonstrating over 96.9% accuracy recovery. In constrast, the pretrained Llama-3.1-8B achieves 50.7% 5-shot accuracy and the sparse foundational Sparse-Llama-3.1-8B-2of4 model achieves 56.3% 5-shot accuracy. This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-gsm8k-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-gsm8k-2of4. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on the lm-evaluation-harness. Metric Llama-3.1-8B (5-shot) Sparse-Llama-3.1-8B-2of4 (5-shot) Llama-3.1-8B-gsm8k (0-shot) Sparse-Llama-3.1-8B-gsm8k-2of4 (0-shot) Sparse-Llama-3.1-8B-gsm8k-2of4-quantized.w4a16 (0-shot)

NaNK
llama
2
0

Llama-3.1-8B-evolcodealpaca

NaNK
llama
2
0

Qwen2.5-0.5B-FP8-dynamic

NaNK
license:apache-2.0
2
0

Qwen2.5-Coder-32B-Instruct-FP8-dynamic

NaNK
2
0

Phi-4-mini-instruct-quantized.w8a8

2
0

Llama2-7b-chat-pruned50-quant-ds

NaNK
llama
1
9

Nous-Hermes-2-SOLAR-10.7B-pruned50-quant-ds

NaNK
llama
1
7

OpenHermes-2.5-Mistral-7B-pruned50-quant-ds

NaNK
1
2

Phi-3-medium-128k-instruct-quantized.w8a16

NaNK
license:mit
1
2

mpt-7b-gsm8k-pt

NaNK
1
1

mpt-7b-gsm8k-quant-ds

NaNK
1
1

MiniChat-1.5-3B-pruned50-quant-ds

NaNK
llama
1
1

phi-2-super-marlin

license:mit
1
1

Qwen2.5-3B-quantized.w8a8

NaNK
license:apache-2.0
1
1

whisper-large-v2-W4A16-G128

Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 1/31/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. BibTeX entry and citation info ```bibtex @misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

NaNK
license:apache-2.0
1
1

oBERT-12-downstream-pruned-unstructured-80-qqp

1
0

oBERT-6-upstream-pretrained-dense

1
0

oBERT-6-downstream-pruned-block4-90-squadv1

1
0