license:apache-2.0

37,297

NaNK

license:apache-2.0

19,721

gemma-3-27b-it-quantized.w4a16

Model Overview - Model Architecture: google/gemma-3-27b-it - Input: Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 6/4/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3-27b-it to INT4 data type, ready for inference with vLLM >= 0.8.0. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below: The model was evaluated using lmevaluationharness for OpenLLM v1 text benchmark. The evaluations were conducted using the following commands: Category Metric google/gemma-3-27b-it RedHatAI/gemma-3-27b-it-quantized.w8a8 Recovery (%)

NaNK

—

18,379

Meta-Llama-3-8B-Instruct-FP8-KV

NaNK

llama

16,461

Qwen2-7B-Instruct-FP8

Model Overview - Model Architecture: Qwen2 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Intended Use Cases: Intended for commercial and research use in English. Similarly to Meta-Llama-3-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 6/14/2024 - Version: 1.0 - License(s): apache-2.0 - Model Developers: Neural Magic Quantized version of Qwen2-7B-Instruct. It achieves an average score of 69.44 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 69.55. This model was obtained by quantizing the weights and activations of Qwen2-7B-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.0. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with 512 sequences of UltraChat. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying AutoFP8 with calibration samples from ultrachat, as presented in the code snipet below. Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using llm-compressor which supports several quantization schemes and models not supported by AutoFP8. The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the lm-evaluation-harness (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the vLLM engine, using the following command:

NaNK

license:apache-2.0

16,224

DeepSeek-Coder-V2-Lite-Instruct-FP8

—

15,111

Qwen3-8B-speculator.eagle3

Model Overview - Verifier: Qwen/Qwen3-8B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 07/27/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-8B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the HuggingFaceH4/ultrachat200k datasets. The model was trained with thinking turned disabled. This model should be used with the Qwen/Qwen3-8B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.62 1.96 2.13 2.24 2.25 2.29 2.30 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-8B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'

NaNK

license:apache-2.0

13,288

Llama-3.2-11B-Vision-Instruct-FP8-dynamic

NaNK

mllama

11,956

DeepSeek-R1-Distill-Llama-8B-quantized.w8a8

NaNK

llama

10,577

Llama-3.2-1B-Instruct-quantized.w8a8

Model Overview - Model Architecture: Llama-3 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - Intended Use Cases: Intended for commercial and research use multiple languages. Similarly to Llama-3.2-1B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 9/25/2024 - Version: 1.0 - License(s): Llama3.2 - Model Developers: Neural Magic Quantized version of Llama-3.2-1B-Instruct. It achieves scores within 5% of the scores of the unquantized model for MMLU, ARC-Challenge, GSM-8k, Hellaswag, Winogrande and TruthfulQA. This model was obtained by quantizing the weights of Llama-3.2-1B-Instruct to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between INT8 and floating point representations for each output channel dimension. Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between INT8 and floating point representations. The SmoothQuant algorithm is used to alleviate outliers in the activations, whereas rhe GPTQ algorithm is applied for quantization. Both algorithms are implemented in the llm-compressor library. GPTQ used a 1% damping factor and 512 sequences sequences taken from Neural Magic's LLM compression calibration dataset. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by using the llm-compressor library as presented in the code snipet below. The model was evaluated on MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA. Evaluation was conducted using the Neural Magic fork of lm-evaluation-harness (branch llama3.1instruct) and the vLLM engine. This version of the lm-evaluation-harness includes versions of MMLU, ARC-Challenge and GSM-8K that match the prompting style of Meta-Llama-3.1-Instruct-evals. The results were obtained using the following commands:

NaNK

llama

10,123

Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic

NaNK

license:apache-2.0

9,380

Qwen3-32B-NVFP4A16

NaNK

license:apache-2.0

9,334

Qwen3-Coder-Next-NVFP4

license:apache-2.0

9,056

Meta-Llama-3.1-8B-FP8

NaNK

llama

8,923

Llama-3.1-8B-Instruct-speculator.eagle3

Model Overview - Verifier: meta-llama/Llama-3.1-8B-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 07/27/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-3.1-8B-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the HuggingFaceH4/ultrachat200k datasets. This model should be used with the meta-llama/Llama-3.1-8B-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.70 2.19 2.50 2.78 2.77 2.98 2.99 - temperature: 0.6 - topp: 0.9 - repetitions: 5 - time per experiment: 3min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/SpeculativeDecoding" \ --rate-type sweep \ --max-seconds 180 \ --output-path "Llama-3.1-8B-Instruct-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.0}}}'

gemma-3-4b-it-quantized.w4a16

NaNK

—

6,756

Qwen3-30B-A3B-FP8-dynamic

NaNK

license:apache-2.0

6,679

Qwen2-1.5B-Instruct-FP8

NaNK

license:apache-2.0

6,000

Qwen2.5-VL-3B-Instruct-FP8-dynamic

NaNK

license:apache-2.0

5,878

Mistral-Small-3.2-24B-Instruct-2506-NVFP4

Model Overview - Model Architecture: unsloth/Mistral-Small-3.2-24B-Instruct-2506 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of unsloth/Mistral-Small-3.2-24B-Instruct-2506. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of unsloth/Mistral-Small-3.2-24B-Instruct-2506 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 25%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. Category Metric unsloth/Mistral-Small-3.2-24B-Instruct-2506 RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 Recovery The results were obtained using the following commands:

license:apache-2.0

2,110

gemma-3-12b-it-quantized.w8a8

NaNK

—

2,109

Qwen2.5-VL-72B-Instruct-FP8-dynamic

NaNK

license:apache-2.0

2,006

Qwen2.5-VL-7B-Instruct-quantized.w8a8

NaNK

license:apache-2.0

1,869

NVIDIA-Nemotron-Nano-9B-v2-FP8-dynamic

Model Overview - Model Architecture: NemotronHForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 9/30/2025 - V...

Llama-4-Scout-17B-16E-Instruct

NaNK

llama4

1,353

Qwen3-4B-quantized.w4a16

NaNK

license:apache-2.0

1,297

gemma-3-27b-it-quantized.w8a8

NaNK

—

1,261

gpt-oss-120b-speculator.eagle3

NaNK

license:apache-2.0

1,067

whisper-large-v3-quantized.w4a16

Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

NaNK

license:apache-2.0

658

Qwen3-235B-A22B-Instruct-2507-NVFP4

Model Overview - Model Architecture: Qwen/Qwen3-235B-A22B-Instruct-2507 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-235B-A22B-Instruct-2507. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-235B-A22B-Instruct-2507 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-235B-A22B-Instruct-2507 RedHatAI/Qwen3-235B-A22B-Instruct-2507-NVFP4 (this model) Recovery The results were obtained using the following commands:

NaNK

license:apache-2.0

629

llama2.c-stories110M-pruned50

NaNK

llama

627

DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16

NaNK

license:mit

625

Model Overview - Verifier: Qwen/Qwen3-32B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/17/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-32B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the Qwen/Qwen3-32B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.62 1.95 2.15 2.23 2.27 2.32 2.33 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 2xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-32B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'

Qwen3-0.6B-quantized.w4a16

NaNK

license:apache-2.0

481

Qwen3-14B-quantized.w4a16

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Intended Use Cases: - Reasoning. - Function calling. - Subject matter experts via fine-tuning. - Multilingual instruction following. - Translation. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). - Release Date: 05/05/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing the weights of Qwen3-14B to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a asymmetric per-group scheme, with group size 64. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the OpenLLM leaderboard tasks (versions 1 and 2), using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.

NaNK

license:apache-2.0

460

Voxtral-Mini-3B-2507-FP8-dynamic

NaNK

license:apache-2.0

457

Qwen2-72B-Instruct-FP8

NaNK

—

450

Mixtral-8x7B-Instruct-v0.1-FP8

NaNK

license:apache-2.0

420

whisper-large-v3-turbo-quantized.w4a16

license:apache-2.0

407

Mistral-Nemo-Instruct-2407-quantized.w4a16

license:llama2

399

granite-4.0-h-tiny-FP8-dynamic

license:apache-2.0

395

phi-4-FP8-dynamic

DeepSeek-R1-Distill-Qwen-7B-quantized.w4a16

NaNK

license:mit

348

Qwen3-30B-A3B-Thinking-2507-speculator.eagle3

NaNK

license:apache-2.0

339

Qwen2.5-7B-Instruct-FP8-dynamic

NaNK

license:apache-2.0

337

Llama-2-7b-pruned70-retrained

NaNK

llama

326

DeepSeek-V2.5-1210-FP8

—

321

Apertus-8B-Instruct-2509-FP8-dynamic

NaNK

license:apache-2.0

314

whisper-large-v3-quantized.w8a8

Model Overview - Model Architecture: whisper-large-v3 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v3 to INT8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

NaNK

license:apache-2.0

314

SmolLM3-3B-FP8-dynamic

NaNK

license:apache-2.0

308

Meta-Llama-3.1-8B-quantized.w8a8

NaNK

llama

303

whisper-large-v3-turbo-FP8-dynamic

Model Overview - Model Architecture: whisper-large-v3-turbo - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of openai/whisper-large-v3-turbo. This model was obtained by quantizing the weights of openai/whisper-large-v3-turbo to FP8 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands:

license:apache-2.0

292

Qwen3-235B-A22B-NVFP4

Model Overview - Model Architecture: Qwen/Qwen3-235B-A22B - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/29/2025 - Version: 1.0 - Model Developers: RedHatAI This model is a quantized version of Qwen/Qwen3-235B-A22B. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Qwen/Qwen3-235B-A22B to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks using lm-evaluation-harness. The Reasoning evals were done using ligheval. Category Metric Qwen/Qwen3-235B-A22B RedHatAI/Qwen3-235B-A22B-NVFP4 (this model) Recovery The results were obtained using the following commands:

NaNK

license:apache-2.0

290

Llama-4-Maverick-17B-128E-Instruct-FP8

NaNK

llama4

286

Qwen3-235B-A22B-Instruct-2507-speculator.eagle3

NaNK

license:apache-2.0

286

Meta-Llama-3-8B-Instruct-quantized.w8a16

NaNK

llama

278

Qwen2-57B-A14B-Instruct-FP8

NaNK

license:apache-2.0

277

Phi-3-medium-128k-instruct-quantized.w4a16

license:llama2

275

DeepSeek-R1-Distill-Llama-8B-FP8-dynamic

NaNK

llama

273

Mistral-7B-Instruct-v0.3-quantized.w4a16

NaNK

license:apache-2.0

269

gemma-2-9b-it-FP8

NaNK

—

266

Llama-3.1-8B-Instruct-NVFP4

Model Overview - Model Architecture: Meta-Llama-3.1 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 10/23/2025 - Version: 1.0 - License(s): llama3.1 - Model Developers: RedHatAI This model is a quantized version of Meta-Llama-3.1-8B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval64 benchmarks. All evaluations were conducted using lm-evaluation-harness. Category Metric Meta-Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct-NVFP4 (this model) Recovery gsm8kllama 78.17 79.30 101.45 hellaswag 78.43 78.01 99.46 mmlullama 69.37 65.95 95.07 mmlucotllama 72.86 68.60 94.15 truthfulqamc2 55.09 52.95 96.12 winogrande 75.77 74.03 97.70 Average 73.29 71.59 97.68 Category Metric Meta-Llama-3.1-8B-Instruct RedHatAI/Llama-3.1-8B-Instruct-NVFP4 (this model) Recovery (%) The results were obtained using the following commands:

NaNK

llama

266

phi-4

NaNK

license:mit

262

Kimi-K2-Instruct-quantized.w4a16

—

259

Apertus-70B-Instruct-2509-quantized.w4a16

NaNK

license:apache-2.0

259

Qwen3-4B-FP8-dynamic

NaNK

license:apache-2.0

253

gemma-3n-E2B-it-quantized.w8a8

Model Overview - Model Architecture: gemma-3n-E2B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT8 - Activation quantization: INT8 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights and activations of google/gemma-3n-E2B-it to INT8 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E2B-it RedHatAI/gemma-3n-E2B-it-quantized.w8a8 Recovery (%)

NaNK

license:mit

251

Llama-3.1-70B-Instruct-NVFP4

NaNK

llama

249

Qwen2.5-VL-72B-Instruct-quantized.w4a16

NaNK

license:apache-2.0

227

Qwen2.5-7B-quantized.w8a8

NaNK

license:apache-2.0

225

Llama-4-Scout-17B-16E-Instruct-FP8-block

Model Overview - Model Architecture: Llama4ForConditionalGeneration - Input: Text, Image - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-4-Scout-17B-16E-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-4-Scout-17B-16E-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-4-Scout-17B-16E-Instruct RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 69.62 68.60 98.53 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 89.09 89.93 100.94

NaNK

license:apache-2.0

176

DeepSeek-R1-Distill-Llama-70B-quantized.w4a16

NaNK

llama

175

bge-base-en-v1.5-quant

license:mit

172

Llama-3.3-70B-Instruct-NVFP4

Model Overview - Model Architecture: Meta-Llama-3.3 - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP4 - Activation quantization: FP4 - Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.3-8B-Instruct, this models is intended for assistant-like chat. - Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - Release Date: 6/25/2025 - Version: 1.0 - License(s): llama3.3 - Model Developers: RedHatAI This model is a quantized version of Meta-Llama-3.3-70B-Instruct. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. This model was obtained by quantizing the weights and activations of Meta-Llama-3.3-70B-Instruct to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below. This model was evaluated on the well-known OpenLLM v1, OpenLLM v2, HumanEval, and HumanEval64 benchmarks. All evaluations were conducted using lm-evaluation-harness. Category Metric Meta-Llama-3.3-70B-Instruct RedHatAI/Llama-3.3-70B-Instruct-NVFP4 (this model) Recovery gsm8kllama (8-shot, strict-match) 85.22 77.10 90.47 The results were obtained using the following commands:

Model Overview - Model Architecture: LlamaForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat Quantized version of meta-llama/Llama-3.1-8B-Instruct. This model was obtained by quantizing the weights and activations of meta-llama/Llama-3.1-8B-Instruct to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLMv1 leaderboard task, using lm-evaluation-harness, on reasoning tasks using lighteval. vLLM was used for all evaluations. Category Metric meta-llama/Llama-3.1-8B-Instruct RedHatAI/Llama-3.1-8B-Instruct-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 60.92 60.92 100.00 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 81.89 81.41 99.41

Qwen3-Next-80B-A3B-Thinking-FP8-dynamic

NaNK

license:apache-2.0

120

Model Overview - Model Architecture: Qwen3ForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: - Version: 1.0 - Model Developers:: Red Hat This model was obtained by quantizing the weights and activations of Qwen/Qwen3-8B to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized. This model was quantized using the llm-compressor library as shown below. The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations. Category Metric Qwen/Qwen3-8B nm-testing/Qwen3-8B-FP8-block Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 67.66 67.92 100.38 OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 48.56 48.80 100.49

NaNK

license:apache-2.0

DeepSeek-R1-Distill-Qwen-1.5B-quantized.w4a16

NaNK

license:mit

pixtral-12b-FP8-dynamic

NaNK

license:apache-2.0

Mistral-Small-24B-Instruct-2501

NaNK

license:apache-2.0

Model Overview - Model Architecture: NemotronHForCausalLM - Input: Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 10/22/2025 - Version: 1.0 - Model Developers: RedHat (Neural Magic) This model was obtained by quantizing the weights of NVIDIA-Nemotron-Nano-9B-v2 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 64. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. Creation details This model was created with llm-compressor by running the code snippet below. The model was evaluated on the set of popular reasoning tasks AIME25, Math-500, and GPQA-Diamond, using lighteval `v0.11.1.dev0`. vLLM `v0.11.1rc2.dev191+g80e945298.precompiled` was used as the inference engine for all evaluations. NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 (this model)

NaNK

—

—

Sparse-Llama-3.1-8B-2of4

NaNK

base_model:meta-llama/Llama-3.1-8B

Qwen3-Next-80B-A3B-Thinking-quantized.w4a16

NaNK

license:apache-2.0

Qwen3-14B-speculator.eagle3

Model Overview - Verifier: Qwen/Qwen3-14B - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/18/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with Qwen/Qwen3-14B, based on the EAGLE-3 speculative decoding algorithm. It was trained using the speculators library on a combination of the Aeala/ShareGPTVicunaunfiltered and the `trainsft` split of HuggingFaceH4/ultrachat200k datasets. This model should be used with the Qwen/Qwen3-14B chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.60 1.90 2.06 2.14 2.17 2.19 2.21 - temperature: 0.6 - topp: 0.95 - topk: 20 - repetitions: 3 - time per experiment: 10min - hardware: 1xA100 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 Command ```bash GUIDELLMPREFERREDROUTE="chatcompletions" \ guidellm benchmark \ --target "http://localhost:8000/v1" \ --data "RedHatAI/speculatorbenchmarks" \ --data-args '{"datafiles": "HumanEval.jsonl"}' \ --rate-type sweep \ --max-seconds 600 \ --output-path "Qwen3-14B-HumanEval.json" \ --backend-args '{"extrabody": {"chatcompletions": {"temperature":0.6, "topp":0.95, "topk":20}}}'

NaNK

license:apache-2.0

Qwen3-1.7B-quantized.w4a16

NaNK

license:apache-2.0

Llama-2-7b-chat-hf-FP8

gemma-3n-E4B-it-quantized.w4a16

Model Overview - Model Architecture: gemma-3n-E4B-it - Input: Audio-Vision-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: INT16 - Release Date: 08/01/2025 - Version: 1.0 - Model Developers: RedHatAI This model was obtained by quantizing the weights of google/gemma-3n-E4B-it to INT4 data type, ready for inference with vLLM >= 0.10.0 This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated using lmevaluationharness for OpenLLM V1 and V2 text-based benchmarks. The evaluations were conducted using the following commands: Category Metric google/gemma-3n-E4B-it RedHatAI/gemma-3n-E4B-it-quantized.w4a16 Recovery (%)

NaNK

license:mit

pixtral-12b-quantized.w8a8

NaNK

license:apache-2.0

Mistral-Large-Instruct-2407-FP8

—

bert-large-uncased-finetuned-squadv1

—

Qwen3-32B-Thinking-speculator.eagle3

NaNK

license:apache-2.0

Phi-3-mini-128k-instruct-quantized.w8a8

license:mit

Mistral-Large-3-675B-Instruct-2512-NVFP4

NaNK

license:apache-2.0

Mistral-Large-3-675B-Instruct-2512

NaNK

license:apache-2.0

Qwen2.5-Coder-7B-FP8-dynamic

NaNK

—

oBERT-12-upstream-pruned-unstructured-97-finetuned-qqp

—

Llama4-Maverick-17B-128E-Instruct-speculator.eagle3

Llama-4-Maverick-17B-128E-Instruct-speculators.eagle3 Model Overview - Verifier: meta-llama/Llama-4-Maverick-17B-128E-Instruct - Speculative Decoding Algorithm: EAGLE-3 - Model Architecture: Eagle3Speculator - Release Date: 09/17/2025 - Version: 1.0 - Model Developers: RedHat This is a speculator model designed for use with meta-llama/Llama-4-Maverick-17B-128E-Instruct, based on the EAGLE-3 speculative decoding algorithm. It was converted into the speculators format from the model nvidia/Llama-4-Maverick-17B-128E-Eagle3. This model should be used with the meta-llama/Llama-4-Maverick-17B-128E-Instruct chat template, specifically through the `/chat/completions` endpoint. Text Summarization 1.69 2.12 2.37 2.52 2.60 2.63 2.63 - temperature: 0.6 - topp: 0.9 - repetitions: 3 - time per experiment: 3min - hardware: 8xB200 - vLLM version: 0.11.0 - GuideLLM version: 0.3.0 If you use this model, please cite both the original NVIDIA model and the Speculators library: - Original model by NVIDIA Corporation - Conversion and formatting for Speculators/vLLM compatibility - Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier

license:mit

Llama-2-7b_oneshot-pruned70_C4_10k

NaNK

llama

Qwen2-72B-Instruct-quantized.w8a16

NaNK

license:apache-2.0

oBERT-12-upstream-pruned-unstructured-97-finetuned-mnli

—

oBERT-6-downstream-pruned-unstructured-90-squadv1

—

mpt-7b-gsm8k-pruned50-quant-ds

NaNK

—

Qwen2-0.5B-Instruct-quantized.w8a8

NaNK

license:apache-2.0

Mixtral-8x7B-Instruct-v0.1-AutoFP8

NaNK

NaNK

llama

Llama-4-Maverick-17B-128E-Instruct

NaNK

llama4

Phi-3-mini-128k-instruct-quantized.w4a16

license:llama2

Qwen2-72B-Instruct-quantized.w8a8

NaNK

license:apache-2.0

Qwen2.5-72B-FP8-dynamic

NaNK

license:apache-2.0

whisper-small-quantized.w4a16

Model Overview - Model Architecture: whisper-small - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Release Date: 04/16/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-small to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on LibriSpeech and Fleurs datasets using lmms-eval, via the following commands: Fleurs (X→en, WER) cmnhanscn 23.0642 25.5212 90.37%

NaNK

—

Qwen2.5-14B-FP8-dynamic

NaNK

Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16 Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Weight quantization: INT4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is a code completion AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the evol-codealpaca-v1 dataset, followed by quantization On the HumanEval benchmark, it achieves a pass@1 of 50.6, compared to 48.5 for the fine-tuned dense model Llama-3.1-8B-evolcodealpaca — demonstrating over 100% accuracy recovery. This model was obtained by quantizing the weights of Sparse-Llama-3.1-8B-evolcodealpaca-2of4 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. That is on top of the reduction of 50% of weights via 2:4 pruning employed on Sparse-Llama-3.1-8B-evolcodealpaca-2of4. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT4 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on Neural Magic's fork of EvalPlus. Metric Llama-3.1-8B-evolcodealpaca Sparse-Llama-3.1-8B-evolcodealpaca-2of4 Sparse-Llama-3.1-8B-evolcodealpaca-2of4-quantized.w4a16

NaNK

llama

Qwen2.5-7B-FP8-dynamic

NaNK

license:apache-2.0

Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamic

NaNK

license:mit

mpt-7b-gsm8k-pruned80-quant-ds

NaNK

—

Sparse-Llama-3.1-8B-ultrachat_200k-2of4

Model Overview - Model Architecture: Llama-3.1-8B - Input: Text - Output: Text - Model Optimizations: - Sparsity: 2:4 - Release Date: 11/21/2024 - Version: 1.0 - License(s): llama3.1 - Model Developers: Neural Magic This is a multi-turn conversational AI model obtained by fine-tuning the 2:4 sparse Sparse-Llama-3.1-8B-2of4 on the ultrachat200k dataset. On the AlpacaEval benchmark (version 1), it achieves a score of 61.1, compared to 62.0 for the fine-tuned dense model Llama-3.1-8B-ultrachat200k — demonstrating a 98.5% accuracy recovery. This inherits the optimizations from its parent, Sparse-Llama-3.1-8B-2of4. Namely, all linear operators within transformer blocks were pruned to the 2:4 sparsity pattern: in each group of four weights, two are retained while two are pruned. This model can be deployed efficiently using the vLLM backend. vLLM aslo supports OpenAI-compatible serving. See the documentation for more details. This model was evaluated on Neural Magic's fork of AlpacaEval benchmark. We adopt the same setup as in Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, using version 1 of the benchmark and Llama-2-70b-chat as the annotator. Metric Llama-3.1-8B-ultrachat200k Sparse-Llama-3.1-8B-ultrachat200k-2of4

NaNK

NaNK

license:apache-2.0

Phi-3-medium-128k-instruct-quantized.w8a8

license:mit

mpt-7b-gsm8k-pruned70-quant-ds

NaNK

—

Pixtral-Large-Instruct-2411-hf-FP8-dynamic

NaNK

mpt-7b-gsm8k-pruned60-pt

NaNK

—

llama-2-7b-chat-marlin

NaNK

llama

Llama-2-7b-ultrachat200k-pruned_50-quantized-deepsparse

NaNK

llama

Llama-2-7b-evol-code-alpaca-pruned_50

NaNK

llama

DeepSeek-Coder-V2-Base-FP8

—

license:mit

SmolLM-135M-Instruct-quantized.w8a8

NaNK

llama

Sparse-Llama-3.1-8B-gsm8k-2of4-FP8-dynamic

NaNK

Model Overview - Model Architecture: granite-3.1-2b-base - Input: Text - Output: Text - Model Optimizations: - Weight quantization: FP8 - Activation quantization: FP8 - Release Date: 1/8/2025 - Version: 1.0 - Model Developers: Neural Magic Quantized version of ibm-granite/granite-3.1-2b-base. It achieves an average score of 57.37 on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 57.65. This model was obtained by quantizing the weights and activations of ibm-granite/granite-3.1-2b-base to FP8 data type, ready for inference with vLLM >= 0.5.2. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below. The model was evaluated on OpenLLM Leaderboard V1, OpenLLM Leaderboard V2 and on HumanEval, using the following commands: Category Metric ibm-granite/granite-3.1-2b-base neuralmagic/granite-3.1-2b-base-FP8-dynamic Recovery (%) OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 53.75 53.50 99.54 This model achieves up to 1.2x speedup in single-stream deployment on L40 GPUs. The following performance benchmarks were conducted with vLLM version 0.6.6.post1, and GuideLLM. Single-stream performance (measured with vLLM version 0.6.6.post1) GPU class Model Speedup Code Completion prefill: 256 tokens decode: 1024 tokens Docstring Generation prefill: 768 tokens decode: 128 tokens Code Fixing prefill: 1024 tokens decode: 1024 tokens RAG prefill: 1024 tokens decode: 128 tokens Instruction Following prefill: 256 tokens decode: 128 tokens Multi-turn Chat prefill: 512 tokens decode: 256 tokens Large Summarization prefill: 4096 tokens decode: 512 tokens granite-3.1-2b-base-FP8-dynamic (this model) 1.26 7.3 0.9 7.4 1.0 0.9 1.8 4.1 granite-3.1-2b-base-quantized.w4a16 1.88 4.8 0.6 4.9 0.6 0.6 1.2 2.8

NaNK

license:apache-2.0

Llama-3.1-70B-Instruct-NVFP4A16

NaNK

llama

watt-tool-8B-FP8-dynamic

NaNK

llama

mpt-7b-chat-pruned50-quant-ds

NaNK

—

Mixtral-8x22B-Instruct-v0.1-AutoFP8

NaNK

license:apache-2.0

NaNK

Llama-3.1-8B-evolcodealpaca

NaNK

llama

Qwen2.5-0.5B-FP8-dynamic

NaNK

license:apache-2.0

Qwen2.5-Coder-32B-Instruct-FP8-dynamic

NaNK

—

license:mit

Qwen2.5-3B-quantized.w8a8

NaNK

license:apache-2.0

whisper-large-v2-W4A16-G128

Model Overview - Model Architecture: whisper-large-v2 - Input: Audio-Text - Output: Text - Model Optimizations: - Weight quantization: INT4 - Activation quantization: FP16 - Release Date: 1/31/2025 - Version: 1.0 - Model Developers: Neural Magic This model was obtained by quantizing the weights of openai/whisper-large-v2 to INT4 data type, ready for inference with vLLM >= 0.5.2. This model can be deployed efficiently using the vLLM backend, as shown in the example below. vLLM also supports OpenAI-compatible serving. See the documentation for more details. This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog. BibTeX entry and citation info ```bibtex @misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

NaNK

license:apache-2.0

oBERT-12-downstream-pruned-unstructured-80-qqp

—

oBERT-6-upstream-pretrained-dense

—

oBERT-6-downstream-pruned-block4-90-squadv1

—

RedHatAI

Meta-Llama-3.1-70B-Instruct-quantized.w4a16

Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16

Mistral-7B-Instruct-v0.3-GPTQ-4bit

Meta-Llama-3.1-8B-Instruct-FP8

Llama-3.3-70B-Instruct-FP8-dynamic

Llama-3.2-1B-Instruct-FP8-dynamic

gemma-3-27b-it-FP8-dynamic

Qwen2.5-1.5B-quantized.w8a8

Devstral-Small-2507-FP8-Dynamic

Llama-3.2-1B-Instruct-FP8

gpt-oss-20b-speculator.eagle3

Qwen2.5-VL-7B-Instruct-FP8-Dynamic

Qwen3-30B-A3B-Instruct-2507-speculator.eagle3

Qwen3.5-122B-A10B-NVFP4

Meta-Llama-3.1-8B-Instruct-quantized.w4a16

Qwen3-VL-235B-A22B-Instruct-FP8-dynamic

Meta-Llama-3.1-8B-Instruct-FP8-dynamic

Qwen3-32B-FP8-dynamic

Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Meta-Llama-3.1-8B-Instruct-quantized.w8a8

gpt-oss-20b

Qwen3-VL-235B-A22B-Instruct-NVFP4

Llama-3.3-70B-Instruct-quantized.w8a8

Qwen3-8B-FP8-dynamic

gemma-3-27b-it-quantized.w4a16

Meta-Llama-3-8B-Instruct-FP8-KV

Qwen2-7B-Instruct-FP8

DeepSeek-Coder-V2-Lite-Instruct-FP8

Qwen3-8B-speculator.eagle3

Llama-3.2-11B-Vision-Instruct-FP8-dynamic

DeepSeek-R1-Distill-Llama-8B-quantized.w8a8

Llama-3.2-1B-Instruct-quantized.w8a8

Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic

Qwen3-32B-NVFP4A16

Qwen3-Coder-Next-NVFP4

Meta-Llama-3.1-8B-FP8

Llama-3.1-8B-Instruct-speculator.eagle3

Meta-Llama-3-70B-Instruct-FP8

Llama-3.2-1B-quantized.w8a8

Qwen2.5-VL-3B-Instruct-quantized.w8a8

Llama-3.2-90B-Vision-Instruct-FP8-dynamic

Voxtral-Small-24B-2507-FP8-dynamic

gemma-3-4b-it-quantized.w4a16

Qwen3-30B-A3B-FP8-dynamic

Qwen2-1.5B-Instruct-FP8

Qwen2.5-VL-3B-Instruct-FP8-dynamic

Mistral-Small-3.2-24B-Instruct-2506-NVFP4

gemma-3-12b-it-FP8-dynamic

Meta-Llama-3.1-70B-Instruct-FP8

gemma-3-1b-it-FP8-dynamic

Qwen3-30B-A3B-quantized.w4a16

Llama-3.2-3B-Instruct-FP8

Mistral-Nemo-Instruct-2407-FP8

Meta-Llama-3.1-70B-Instruct-quantized.w8a8

DeepSeek-R1-Distill-Llama-70B-quantized.w8a8

Meta-Llama-3.1-8B-Instruct-quantized.w8a16

Meta-Llama-3-8B-Instruct-quantized.w8a8

Qwen2.5-Coder-14B-Instruct-FP8-dynamic

DeepSeek-R1-Distill-Qwen-7B-quantized.w8a8

Qwen3-8B-quantized.w4a16

phi-4-quantized.w8a8

DeepSeek-R1-Distill-Qwen-32B-quantized.w8a8

DeepSeek-R1-Distill-Qwen-1.5B-quantized.w8a8

Meta-Llama-3.1-405B-Instruct-FP8-dynamic

Llama-4-Maverick-17B-128E-Instruct-NVFP4

DeepSeek-R1-Distill-Qwen-14B-quantized.w8a8

Llama-3.2-3B-quantized.w8a8

Meta-Llama-3-8B-Instruct-FP8

Llama-4-Scout-17B-16E-Instruct-quantized.w4a16

gemma-3-4b-it-FP8-dynamic

Magistral-Small-2506-FP8

gemma-3-12b-it-quantized.w8a8

Qwen2.5-VL-72B-Instruct-FP8-dynamic

Qwen2.5-VL-7B-Instruct-quantized.w8a8

NVIDIA-Nemotron-Nano-9B-v2-FP8-dynamic

Qwen3-0.6B-FP8-BLOCK

gemma-3-12b-it-quantized.w4a16

Qwen3-30B-A3B-NVFP4

Meta-Llama-3.1-70B-Instruct-FP8-dynamic