weathermanj
Nemotron-nano-9b-fp8
NVIDIA-Nemotron-Nano-9B-v2-gguf
GGUF quantizations of NVIDIA’s NVIDIA-Nemotron-Nano-9B-v2. These files target llama.cpp-compatible runtimes. | Model | Size | Bits/Weight | Description | |-------|------|-------------|-------------| | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q80.gguf` | 8.9GB | ~8.0 | Near-lossless, reference quality | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q6K.gguf` | 8.6GB | ~6.0 | High quality, recommended for most users | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q5KM.gguf` | 6.6GB | ~5.0 | Good quality, balanced | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4KM.gguf` | 6.1GB | ~4.0 | Standard choice, good compression | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q41.gguf` | 5.5GB | ~4.0 | Legacy 4-bit (Q41), better than Q40 | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q40.gguf` | 5.0GB | ~4.0 | Legacy 4-bit (Q40), smaller | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ4XS.gguf` | 5.0GB | 4.25 | Integer quantization, excellent compression | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-IQ3M.gguf` | 4.9GB | 3.66 | Ultra-small, mobile/edge deployment | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4KS.gguf` | 5.8GB | ~4.0 | 4-bit K (small), smaller than Q4KM | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-Q2K.gguf` | 4.7GB | ~2.0 | 2-bit K, maximum compression | | `NVIDIA-Nemotron-Nano-9B-v2-gguf-f16.gguf` | 17GB | 16.0 | Full precision reference (optional) | - Download a quantization - `huggingface-cli download weathermanj/NVIDIA-Nemotron-Nano-9B-v2-gguf NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4KM.gguf --local-dir ./` - Run with llama.cpp - `./llama-server -m NVIDIA-Nemotron-Nano-9B-v2-gguf-Q4KM.gguf -c 4096` CPU vs CUDA vs CUDA+FlashAttn on a 24GB RTX 3090, npredict=64, temp=0.7, topp=0.95. | Model | CPU Factoid | CPU Code | CPU Reasoning | CUDA Factoid | CUDA Code | CUDA Reasoning | CUDA+FA Factoid | CUDA+FA Code | CUDA+FA Reasoning | |--------|------------:|---------:|--------------:|-------------:|----------:|---------------:|----------------:|-------------:|------------------:| | IQ3M | 10.96 | 9.83 | 9.84 | 59.51 | 48.83 | 51.22 | 49.46 | 51.48 | 51.54 | | Q4KM | 8.59 | 8.03 | 8.02 | 48.28 | 48.72 | 48.70 | 53.48 | 48.73 | 47.97 | | Q5KM | 7.54 | 7.54 | 7.52 | 49.09 | 46.00 | 46.87 | 51.25 | 50.58 | 47.00 | | Q6K | 6.65 | 6.19 | 5.89 | 52.77 | 41.84 | 42.06 | 47.59 | 41.48 | 42.85 | | Q80 | 6.95 | 5.79 | 5.93 | 45.99 | 40.81 | 41.51 | 48.32 | 41.21 | 41.54 | Notes: - IQ3M is fastest on this setup; Q4KM offers stronger quality with close speed. - Flash Attention helps variably; larger micro-batches (e.g., `--ubatch-size 1024`) can improve throughput. - Base model: nvidia/NVIDIA-Nemotron-Nano-9B-v2 - These are GGUF files suitable for llama.cpp and compatible backends. - Choose a quantization based on your resource/quality needs (see table). - NVIDIA Open Model License: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/