pytorch

21 models • 1 total models in database
Sort by:

gemma-3-12b-it-AWQ-INT4

NaNK
license:apache-2.0
3,783
0

Phi-4-mini-instruct-AWQ-INT4

This repository hosts the Phi4-mini-instruct model quantized with torchao using int4 weight-only quantization and the awq algorithm. This work is brought to you by the PyTorch team. This model can be used directly or served using vLLM for 56% VRAM reduction (3.95 GB needed) and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlupro` task to recover the accuracy for `mmlupro` specifically. It recovered accuracy from `mmlupro` from INT4 checkpoint from 36.98 to 43.13, while bfloat16 baseline accuracy is 46.43. Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. Since the checkpoint is tuned on `mmlupro`, we check against the accuracy for `mmlupro`: | Benchmark | | | | |----------------------------------|----------------|---------------------------|---------------------------| | | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 | mmlupro | 46.43 | 36.98 | 43.13 | Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install | Benchmark | | | | |------------------|----------------|--------------------------------|--------------------------------| | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 | | Peak Memory (GB) | 8.91 | 3.02 (66% reduction) | 3.95 (55.67% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (H100 machine) | Benchmark (Latency) | | | | |----------------------------------|----------------|--------------------------|--------------------------| | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 | latency (batchsize=1) | 1.61s | 1.33s (1.21x speedup) | 1.37s (1.17x speedup) | | latency (batchsize=256) | 5.31s | 5.38s (0.99x speedup) | 5.44s (0.98x speedup) | Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while int4 weight only checkpoint is only expected to have speedup for memory bound situations. Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100 that doesn't regress the performance for batch size 256. It's possible to generate AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(groupsize=128, int4packingforamt="tilepackedto4d", int4chooseqparamsalgorithm="hqq")` Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

license:bsd-3-clause
3,462
3

gemma-3-27b-it-AWQ-INT4

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-27b-it - Quantization Method : AWQ-INT4 - Terms of Use: [Terms][terms] Calibrated with 30 samples of `mmluphilosophy`, got eval accuracy of 80.06, while gemma-3-27b-it-INT4 is 77.17, and bfloat16 baseline is 79.42 Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. | Benchmark | | | | |----------------------------------|------------------------|--------------------------------|---------------------------------| | | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 | | philosophy | 79.42 | 77.17 | 80.06 | Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4 Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install | Benchmark | | | | |----------------------------------|------------------------|--------------------------------|---------------------------------| | | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 | | Peak Memory (GB) | 55.02 | 19.93 (64% reduction) | 27.66 (50% reduction) | Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4 We can use the following code to get a sense of peak memory usage during inference: | Benchmark (Latency) | | | | |----------------------------------|------------------------|--------------------------------|---------------------------------| | | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 | | latency (batchsize=1) | 7.44s | 4.81 (1.55x speedup) | 4.87s (1.53x speedup) | | latency (batchsize=256) | 40.30s | 27.43 (1.47x speedup) | 27.89s (1.44x speedup) | Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4 Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
3,215
2

gemma-3-27b-it-INT4

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-27b-it - Quantization Method : INT4 - Terms of Use: [Terms][terms] Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. | Benchmark | | | |----------------------------------|----------------|---------------------------| | | google/gemma-3-27b-it | pytorch/gemma-3-27b-it-INT4 | | mmlu | 76.48 | 75.15 | | chartqa (multimodal) | 51.80 | 51.92 | language eval Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install multi-modal eval Need to install lmms-eval from source: `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` | Benchmark | | | |------------------|----------------|--------------------------------| | | google/gemma-3-27b-it | pytorch/gemma-3-27b-it-INT4 | | Peak Memory (GB) | 55.01 | 18.62 (66% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (A100 machine) | Benchmark (Latency) | | | |----------------------------------|----------------|--------------------------| | | google/gemma-3-27b-it | pytorch/gemma-3-27b-it-INT4 | | latency (batchsize=1) | 7.46s | 3.90s (1.91x speedup) | Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
2,613
0

Phi-4-mini-instruct-INT8-INT4

license:mit
1,875
2

Qwen3-4B-INT8-INT4

NaNK
license:mit
1,769
2

gemma-3-12b-it-INT4

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-12b-it - Quantization Method : INT4 - Terms of Use: [Terms][terms] Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. | Benchmark | | | |----------------------------------|----------------|---------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | | mmlu | 71.51 | 68.96 | | chartqa (multimodal) | 55.80 | 56.28 | language eval Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install multi-modal eval Need to install lmms-eval from source: `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` | Benchmark | | | |------------------|----------------|--------------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | | Peak Memory (GB) | 24.50 | 8.68 (65% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (A100 machine) | Benchmark (Latency) | | | |----------------------------------|----------------|--------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-INT4 | | latency (batchsize=1) | 3.73s | 2.16s (1.73x speedup) | Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
1,019
0

Phi-4-mini-instruct-FP8

NaNK
license:mit
892
1

gemma-3-27b-it-FP8

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-27b-it - Quantization Method : FP8 - Terms of Use: [Terms][terms] Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. | Benchmark | | | |----------------------------------|----------------|---------------------------| | | google/gemma-3-27b-it | pytorch/gemma-3-27b-it-FP8 | | mmlu | 76.48 | 76.20 | | chartqa (multimodal) | 51.80 | 51.32 | language eval Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install multi-modal eval Need to install lmms-eval from source: `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` | Benchmark | | | |------------------|----------------|--------------------------------| | | google/gemma-3-27b-it | pytorch/gemma-3-27b-it-FP8 | | Peak Memory (GB) | 55.01 | 32.09 (42% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (H100 machine) | Benchmark (Latency) | | | |----------------------------------|----------------|--------------------------| | | google/gemma-3-27b-it | pytorch/gemma-3-27b-it-FP8 | | latency (batchsize=1) | 7.46s | 4.92s (1.52x speedup) | | latency (batchsize=256) | 39.55s | 24.14s (1.64x speedup) | Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
801
3

Qwen3-8B-QAT-INT4

NaNK
license:apache-2.0
422
1

gemma-3-12b-it-FP8

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-12b-it - Quantization Method : FP8 - Terms of Use: [Terms][terms] Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. We also rely on lmms-eval for multi-modal quality eval. We only tested on chartqa for sanity check. | Benchmark | | | |----------------------------------|----------------|---------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-FP8 | | mmlu | 71.51 | 71.30 | | chartqa (multimodal) | 55.80 | 55.36 | language eval Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install multi-modal eval Need to install lmms-eval from source: `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` | Benchmark | | | |------------------|----------------|--------------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-FP8 | | Peak Memory (GB) | 24.50 | 15.47 (37% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (A100 machine) | Benchmark (Latency) | | | |----------------------------------|----------------|--------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-FP8 | | latency (batchsize=1) | 3.73s | 2.76s (1.35x speedup) | | latency (batchsize=256) | 13.63s | 11.49s (1.19x speedup) | Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
380
0

gemma-3-4b-it-HQQ-INT8-INT4

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-4b-it - Quantization Method : HQQ-INT8-INT4 - Terms of Use: [Terms][terms] Gemma3-4B is quantized by the PyTorch team using torchao with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4). The model is suitable for mobile deployment with ExecuTorch. We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a maxseqlength/maxcontextlength of 1024; if you wish to change this, re-export the quantized model following the instructions in Exporting to ExecuTorch.) To run in a mobile app, download the quantized pte and tokenizer and follow the instructions here. | Benchmark | | | |----------------------------------|----------------|---------------------------| | | gemma-3-4b-it | pytorch/gemma-3-4b-it-HQQ-INT8-INT4 | | Benchmark | | | | mmlu | 57.68 | 55.65 | | chartqa (multimodal) | 50.56 | 42.88 | We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install int8 dynamic activation and int4 weight quantization using HQQ (HQQ-INT8-INT4) multi-modal eval Need to install lmms-eval from source: `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` Exporting to ExecuTorch To export to ExecuTorch, we use optimum-executorch. We first install ExecuTorch and optimum-executorch: Now we can export our model to an ExecuTorch pte file and upload it to HuggingFace. The command below exports the model for the XNNPACK backend with a context length of 1024, but this can be adjusted. Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
307
0

Qwen3-8B-INT4

NaNK
license:apache-2.0
136
2

Qwen3-8B-AWQ-INT4

This repository hosts the Qwen3-8B model quantized with torchao using int4 weight-only quantization and the awq algorithm. This work is brought to you by the PyTorch team. This model can be used directly or served using vLLM for 53% VRAM reduction (7.82 GB needed) and 1.34x speedup on H100 GPUs for batch size 1. The model is calibrated with 10 samples from `mmluabstractalgebra` task to recover the accuracy for `mmluabstractalgebra` specifically. AWQ-INT4 improves the accuracy of `mmluabstractalgebra` of INT4 from 55 to 56, while the bfloat16 baseline is 58. Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. | Benchmark | | | | |----------------------------------|----------------|------------------------|---------------------------| | | Qwen/Qwen3-8B | jerryzh168/Qwen3-8B-INT4-skiplmhead | pytorch/Qwen3-8B-AWQ-INT4 | | mmluabstractalgebra | 58 | 55 | 56 | Note that we only calibrate on a single `mmluabstractalgebra` task instead of the entire `mmlu` task since `mmlu` contains many different types of tasks and calibrating on all of them does not necessarily improve the accuracy for all the tasks, since it's harder to faithfully represent the distribution of data from all types of tasks with a selected small calibration sample data. Note: we skipped quantization for `lmhead` because in transformers lmhead is a `Linear` but in vllm lmhead becomes ParallelLMHead and the linear weight no longer works there. Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install | Benchmark | | | | |------------------|----------------|--------------------------------|--------------------------------| | | Qwen/Qwen3-8B | jerryzh168/Qwen3-8B-INT4-skiplmhead | pytorch/Qwen3-8B-AWQ-INT4 | | Peak Memory (GB) | 16.47 | 7.82 (53% reduction) | 7.82 (53% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (H100 machine) | Benchmark (Latency) | | | | |----------------------------------|----------------|---------------------------|---------------------------| | | Qwen/Qwen3-8B | jerryzh168/Qwen3-8B-INT4-skiplmhead | pytorch/Qwen3-8B-AWQ-INT4 | | latency (batchsize=1) | 2.46s | 1.40s (1.76x speedup) | 1.83s (1.34x speedup) | We benchmarked the throughput in a serving environment. Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmarkserving` script. Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
133
1

Phi-4-mini-instruct-parq-3w-4e-shared

license:mit
131
0

Phi-4-mini-instruct-INT4

NaNK
license:bsd-3-clause
98
0

Phi-4-mini-instruct-parq-2w-4e-shared

license:mit
88
0

Phi-4-mini-instruct-parq-4w-4e-shared-gsm

license:mit
53
0

Qwen3-32B-FP8

NaNK
license:apache-2.0
48
0

Gemma 3 12b It QAT INT4

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-12b-it - Quantization Method : QAT INT4 - Terms of Use: [Terms][terms] gemma-3-12b-it fine-tuned with unsloth using quantization-aware training (QAT) from torchao, and quantized with int4 weight only quantization, by PyTorch team. Use it directly or serve using vLLM for 66% VRAM reduction (8.34 GB needed) and 1.73x speedup on H100 GPUs. Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. | Benchmark | | | |----------------------------------|----------------|---------------------------------| | | mmlu accuracy | Normalized accuracy degradation | | google/gemma-3-12b-it | | | | bf16 | 71.51 | -0% | | int4 | 69.48 | -100% | | Fine-tuned without QAT | | | | bf16 | 71.55 | +2% | | int4 | 69.58 | -95% | | Fine-tuned with QAT | | | | int4 | 70.18 | -65.5% | language eval Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install multi-modal eval Need to install lmms-eval from source: `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` | Benchmark | | | |------------------|-------------------------|-------------------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-QAT-INT4 | | Peak Memory (GB) | 24.50 | 8.34 (66% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (H100 machine) | Benchmark (Latency) | | | |----------------------------------|-------------------------|------------------------------------| | | google/gemma-3-12b-it | pytorch/gemma-3-12b-it-QAT-INT4 | | latency (batchsize=1) | 3.73s | 2.16s (1.73x speedup) | Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

NaNK
license:apache-2.0
40
3

SmolLM3-3B-INT8-INT4

NaNK
license:apache-2.0
37
37