pytorch

21 models • 1 total models in database

Sort by:

gemma-3-12b-it-AWQ-INT4

Phi-4-mini-instruct-AWQ-INT4

This repository hosts the Phi4-mini-instruct model quantized with torchao using int4 weight-only quantization and the awq algorithm. This work is brought to you by the PyTorch team. This model can be used directly or served using vLLM for 56% VRAM reduction (3.95 GB needed) and 1.17x speedup on H100 GPUs. The model is calibrated with 2 samples from `mmlupro` task to recover the accuracy for `mmlupro` specifically. It recovered accuracy from `mmlupro` from INT4 checkpoint from 36.98 to 43.13, while bfloat16 baseline accuracy is 46.43. Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. Since the checkpoint is tuned on `mmlupro`, we check against the accuracy for `mmlupro`: | Benchmark | | | | |----------------------------------|----------------|---------------------------|---------------------------| | | microsoft/Phi-4-mini-instruct | pytorch/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 | mmlupro | 46.43 | 36.98 | 43.13 | Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install | Benchmark | | | | |------------------|----------------|--------------------------------|--------------------------------| | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 | | Peak Memory (GB) | 8.91 | 3.02 (66% reduction) | 3.95 (55.67% reduction) | We can use the following code to get a sense of peak memory usage during inference: Results (H100 machine) | Benchmark (Latency) | | | | |----------------------------------|----------------|--------------------------|--------------------------| | | microsoft/Phi-4-mini-instruct | jerryzh168/Phi-4-mini-instruct-INT4 | pytorch/Phi-4-mini-instruct-AWQ-INT4 | latency (batchsize=1) | 1.61s | 1.33s (1.21x speedup) | 1.37s (1.17x speedup) | | latency (batchsize=256) | 5.31s | 5.38s (0.99x speedup) | 5.44s (0.98x speedup) | Note: it's expected that the awq-int4 checkpoint is slower when batch size is 256 since the problem is not memory bound but becomes compute bound when batch size is larger, while int4 weight only checkpoint is only expected to have speedup for memory bound situations. Note: we are comparing to jerryzh168/Phi-4-mini-instruct-INT4 which is a checkpoint for H100, since the AWQ-INT4 is using the new INT4 config that's optimized for H100 that doesn't regress the performance for batch size 256. It's possible to generate AWQ-INT4 for A100 as well using `Int4WeightOnlyConfig(groupsize=128, int4packingforamt="tilepackedto4d", int4chooseqparamsalgorithm="hqq")` Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

license:bsd-3-clause

3,462

gemma-3-27b-it-AWQ-INT4

- Developed by: pytorch - License: apache-2.0 - Quantized from Model : google/gemma-3-27b-it - Quantization Method : AWQ-INT4 - Terms of Use: [Terms][terms] Calibrated with 30 samples of `mmluphilosophy`, got eval accuracy of 80.06, while gemma-3-27b-it-INT4 is 77.17, and bfloat16 baseline is 79.42 Inference with vLLM Install vllm nightly and torchao nightly to get some recent changes: Serving Then we can serve with the following command: Note: please use `VLLMDISABLECOMPILECACHE=1` to disable compile cache when running this code, e.g. `VLLMDISABLECOMPILECACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao, this is expected be resolved in pytorch 2.8. and use a token with write access, from https://huggingface.co/settings/tokens Model Quality We rely on lm-evaluation-harness to evaluate the quality of the quantized model. Here we only run on mmlu for sanity check. | Benchmark | | | | |----------------------------------|------------------------|--------------------------------|---------------------------------| | | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 | | philosophy | 79.42 | 77.17 | 80.06 | Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4 Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install | Benchmark | | | | |----------------------------------|------------------------|--------------------------------|---------------------------------| | | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 | | Peak Memory (GB) | 55.02 | 19.93 (64% reduction) | 27.66 (50% reduction) | Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4 We can use the following code to get a sense of peak memory usage during inference: | Benchmark (Latency) | | | | |----------------------------------|------------------------|--------------------------------|---------------------------------| | | google/gemma-3-27b-it | jerryzh168/gemma-3-27b-it-INT4 | pytorch/gemma-3-27b-it-AWQ-INT4 | | latency (batchsize=1) | 7.44s | 4.81 (1.55x speedup) | 4.87s (1.53x speedup) | | latency (batchsize=256) | 40.30s | 27.43 (1.47x speedup) | 27.89s (1.44x speedup) | Note: jerryzh168/gemma-3-27b-it-INT4 is the H100 optimized checkpoint for INT4 Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization The model's quantization is powered by TorchAO, a framework presented in the paper TorchAO: PyTorch-Native Training-to-Serving Model Optimization. Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . Resources Official TorchAO GitHub Repository: https://github.com/pytorch/ao TorchAO Documentation: https://docs.pytorch.org/ao/stable/index.html Disclaimer PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

pytorch

gemma-3-12b-it-AWQ-INT4

Phi-4-mini-instruct-AWQ-INT4

gemma-3-27b-it-AWQ-INT4

gemma-3-27b-it-INT4

Phi-4-mini-instruct-INT8-INT4

Qwen3-4B-INT8-INT4

gemma-3-12b-it-INT4

Phi-4-mini-instruct-FP8

gemma-3-27b-it-FP8

Qwen3-8B-QAT-INT4

gemma-3-12b-it-FP8

gemma-3-4b-it-HQQ-INT8-INT4

Qwen3-8B-INT4

Qwen3-8B-AWQ-INT4

Phi-4-mini-instruct-parq-3w-4e-shared

Phi-4-mini-instruct-INT4

Phi-4-mini-instruct-parq-2w-4e-shared

Phi-4-mini-instruct-parq-4w-4e-shared-gsm

Qwen3-32B-FP8

Gemma 3 12b It QAT INT4

SmolLM3-3B-INT8-INT4