ubergarm
MiniMax-M2.5-GGUF
Kimi-K2-Thinking-GGUF
imatrix Quantization of moonshotai/Kimi-K2-Thinking UPDATE: The `smol-IQ3KS` scored 77.3% on aider polyglot benchmark with 2x speed-up over similar sized mainline `UD-IQ3XXS`! Details in discussion 14 here. Thanks Fernanda24! The "full quality" baseline `Q4X` quant runs on both on mainline llama.cpp and ikllama.cpp. The other quants in this collection REQUIRE ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. Big Thanks Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking to get this out quickly! 🫶 and jukofyork for the `Q4X` patch! Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Finally, I really appreciate all the support from aifoundry.org so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants! Quant Collection Perplexity computed against wiki.test.raw. The `Q4X` version scores perplexity equivalent to a full 1TB Q80 test quant using a one line patch to adjust q40 to better fit the original QAT target quantization. Discussions ongoing on llama.cpp PR#17064 and directly with moonshot on their huggingface discussions ai as it seems they only used 15 of 16 possible 4bit values possibly? This is the "full quality" baseline version of the model and the only one in this collection with works on both ikllama.cpp and mainline llama.cpp. It does not use an imatrix and was created going from the original model to full bf16 before further quantization. The exact PR used is linked below in references. This quant was used to make the imatrix for the rest of the collection. smol-IQ4KSS 485.008 GiB (4.059 BPW) Final estimate: PPL = 2.1343 +/- 0.00934 IQ3K 459.432 GiB (3.845 BPW) Final estimate: PPL = 2.1456 +/- 0.00941 NOTE: Given there were some issues with the original q40 quantization, I've replaced the original IQ3K with this new smaller one using the patched q4x quantization. The original one was `474.772 GiB (3.973 BPW)` and will be squash deleted to save on public quota soon. This new one uses q4x patched and only applies imatrix to the iq3k tensors but not to the q80 or q4x. More details in discussion 4 here. It has almost the same perplexity so a good improvement. smol-IQ3KS 388.258 GiB (3.249 BPW) Final estimate: PPL = 2.2363 +/- 0.01004 IQ2KL 348.883 GiB (2.920 BPW) Final estimate: PPL = 2.3735 +/- 0.01082 smol-IQ2KL 329.195 GiB (2.755 BPW) Final estimate: PPL = 2.4550 +/- 0.01129 smol-IQ2KS 270.133 GiB (2.261 BPW) Final estimate: PPL = 2.9361 +/- 0.01451 smol-IQ1KT 218.936 GiB (1.832 BPW) Final estimate: PPL = 3.5931 +/- 0.01889 Also keep in mind `KT` trellis quants generally are slower during TG given likely compute bottleneck if running on CPU, but if it is all you can fit then well... Quick Start You will want to override the template given they patched the original template here: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/chattemplate.jinja You can do stuff like `--jinja --chat-template-file ./models/templates/Kimi-K2-Thinking.jinja`. You will also need to pass `--special` for it to output ` ` and` ` tags correctly depending on endpoint and client used, thanks u/Melodic-Network4374 but note it will then also print out ` ` so you can set your client to use that as a stop string. If you don't have enough RAM+VRAM, remove `--no-mmap` to mmap() "troll rig" it paging weights read-only off of disk for a couple tok/sec maybe depending. Adjust `--threads` and `--threads-batch` as needed. For smaller CPUs I recommend setting them both the same equal to the number of physical cores. For an amd 9950x that would be `-t 16` for example. Experiment on larger rigs especially with multiple socket NUMA considerations (avoid cross-NUMA memory access if possible). With ikllama.cpp you can get some extra VRAM by using `-amb 512` to fix the size of the MLA computation buffers. (only works on models with MLA style attention like Kimi-K2 and DeepSeek) References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt moonshotai/Kimi-K2-Thinking/discussions/2 vllm-project/compressed-tensors/issues/511 llama.cpp PR#17069
Qwen3.5-122B-A10B-GGUF
GLM-5.1-GGUF
Qwen3.5-35B-A3B-GGUF
Step-3.5-Flash-GGUF
Qwen3.5-397B-A17B-GGUF
GLM-4.7-Flash-GGUF
GLM-4.6-GGUF
`ikllama.cpp` imatrix Quantizations of zai-org/GLM-4.6 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison: `BF16` 664.707 GiB (16.003 BPW) - Final estimate: PPL = 3.4454 +/- 0.01999 `Q80` 353.259 GiB (8.505 BPW) - Final estimate: PPL = 3.4471 +/- 0.02001 IQ5K 249.099 GiB (5.997 BPW) Final estimate: PPL = 3.4428 +/- 0.01993 IQ4K 207.708 GiB (5.001 BPW) Final estimate: PPL = 3.4758 +/- 0.02023 IQ4KS 192.967 GiB (4.646 BPW) Final estimate: PPL = 3.5309 +/- 0.02057 smol-IQ4KSS 169.895 GiB (4.090 BPW) Final estimate: PPL = 3.5911 +/- 0.02092 IQ3KS 148.390 GiB (3.573 BPW) Final estimate: PPL = 3.6427 +/- 0.02127 IQ2KL 127.516 GiB (3.070 BPW) Final estimate: PPL = 4.1456 +/- 0.02521 smol-IQ2KS 97.990 GiB (2.359 BPW) Final estimate: PPL = 5.2760 +/- 0.03410 Did not use PR624 https://github.com/ikawrakow/ikllama.cpp/pull/624 (it would probably give slightly perplexity better, but a pain to rebase and confirm at this point, lol) smol-IQ1KT 80.906 GiB (1.948 BPW) Final estimate: PPL = 5.9034 +/- 0.03812 Quick Start If you want to disable thinking, add `/nothink` (correct, no underscore) at the end of your prompt. References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt bartowski mainline llama.cpp GLM-4.6 fix PR16359 ikllama.cpp PR814 Downtown-Case Speed benchmarks on local gaming rig More good quants by AesSedai/GLM-4.6-GGUF
Qwen3.5-27B-GGUF
Ling-1T-GGUF
`ikllama.cpp` imatrix Quantizations of inclusionAI/Ling-1T This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Finally, I appreciate all the support from aifoundry.org and team as well as huggingface for hosting all these big quants! Quant Collection Perplexity computed against wiki.test.raw. This one is just a test quant for baseline perplexity comparison: `Q80` 989.678 GiB (8.504 BPW) - Final estimate: PPL = 1.9859 +/- 0.00907 IQ5K 689.866 GiB (5.928 BPW) Final estimate: PPL = 1.9897 +/- 0.00910 smol-IQ4KSS 471.923 GiB (4.055 BPW) Final estimate: PPL = 2.0176 +/- 0.00927 smol-IQ3KS 378.853 GiB (3.255 BPW) Final estimate: PPL = 2.0770 +/- 0.00968 IQ2K 330.923 GiB (2.843 BPW) Final estimate: PPL = PPL = 2.2169 +/- 0.01055 This will use full q80 for VRAM layers and likely suit 384 RAM/VRAM. smol-IQ2KS 264.984 GiB (2.277 BPW) Final estimate: PPL = 2.4429 +/- 0.01191 Should hopefully fit in 250 GiB RAM + 15 GiB VRAM + kv-cache/context...🤞 Leaving the `attn.`/first 4 dense layers/shexp at full q80 would take about 20.1 GiB VRAM which is how the `iqNk` quants are done. smol-IQ2XXS 249.92 GiB (2.15 BPW) Final estimate: PPL = 2.5870 +/- 0.01279 This is a rare mainline compatible quant I released for folks to test this PR: https://github.com/ggml-org/llama.cpp/pull/16063 smol-IQ1KT 215.423 GiB (1.851 BPW) Final estimate: PPL = 2.8581 +/- 0.01471 One of the smallest yet functional quants available, but keep in mind KT types can be slower on CPU inferencing due to likely being computed bottlenecked calculating trellis during TG. Still worth a try if this is all your rig can fit! References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt ikllama.cpp PR833 mainline llama.cpp PR16063
MiniMax-M2.7-GGUF
Qwen3-Coder-Next-GGUF
GLM-4.5-Air-GGUF
`ikllama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison: `BF16` 205.811 GiB (16.004 BPW) - Final estimate: PPL = 4.5704 +/- 0.02796 `Q80` 109.381 GiB (8.505 BPW) - Final estimate: PPL = 4.5798 +/- 0.02804 IQ5K 77.704 GiB (6.042 BPW) Final estimate: PPL = 4.5867 +/- 0.02806 IQ5KS 72.855 GiB (5.665 BPW) Final estimate: PPL = 4.5948 +/- 0.02815 IQ4K 62.910 GiB (4.892 BPW) Final estimate: PPL = 4.6273 +/- 0.02839 IQ4KSS 54.801 GiB (4.261 BPW) Final estimate: PPL = 4.7056 +/- 0.02909 IQ3KS 49.072 GiB (3.816 BPW) Final estimate: PPL = 4.7975 +/- 0.02972 IQ2KL 43.870 GiB (3.411 BPW) Final estimate: PPL = 5.0697 +/- 0.03166 IQ1KT 36.039 GiB (2.802 BPW) Final estimate: PPL = 5.8214 +/- 0.03767 Quick Start If you want to disable thinking, add `/nothink` (correct, no underscore) at the end of your prompt. References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt Mainline llama.cpp Draft PR14939 ikllama.cpp GLM-4.5 MoE PR668
DeepSeek V3.1 GGUF
`ikllama.cpp` imatrix Quantizations of deepseek-ai/DeepSeek-V3.1 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. This first is just a "pure" test quant for baseline perplexity comparison: `Q80` 664.295 GiB (8.504 BPW) - Final estimate: PPL = 3.3473 +/- 0.01935 IQ5K 465.075 GiB (5.944 BPW) Final estimate: PPL = 3.3550 +/- 0.01942 IQ4K 384.765 GiB (4.925 BPW) Final estimate: PPL = 3.3715 +/- 0.01956 IQ4KS 363.151 GiB (4.649 BPW) Final estimate: PPL = 3.3806 +/- 0.01966 IQ4KSS 325.088 GiB (4.162 BPW) Final estimate: PPL = 3.3887 +/- 0.01968 smol-IQ4KSS 318.745 GiB (4.080 BPW) Final estimate: PPL = 3.3898 +/- 0.01964 IQ3K 293.177 GiB (3.753 BPW) Final estimate: PPL = 3.4260 +/- 0.01995 NOTE: Made with https://github.com/ikawrakow/ikllama.cpp/pull/624 IQ3KS 277.397 GiB (3.551 BPW) Final estimate: PPL = 3.4534 +/- 0.02019 NOTE: Made with https://github.com/ikawrakow/ikllama.cpp/pull/624 IQ2KL 231.206 GiB (2.960 BPW) Final estimate: PPL = 3.6312 +/- 0.02161 NOTE: Made with https://github.com/ikawrakow/ikllama.cpp/pull/624 IQ2KT 204.592 GiB (2.619 BPW) Final estimate: PPL = 3.8109 +/- 0.02294 Remember, the KT quants are better suited for full GPU offload as calculating trellis on CPU bottlenecks token generation. IQ2KS 193.144 GiB (2.472 BPW) Final estimate: PPL = 3.9583 +/- 0.02433 NOTE: Made with https://github.com/ikawrakow/ikllama.cpp/pull/624 IQ1KT 154.968 GiB (1.984 BPW) Final estimate: PPL = 4.3987 +/- 0.02786 Remember, the KT quants are better suited for full GPU offload e.g. 2x RTX 6000 Pro Blackwells in this case. IQ1S 133.610 GiB (1.710 BPW) Final estimate: PPL = 5.3113 +/- 0.03507 Multi-GPU is well supported with custom `-ot ...=CUDA1` offload regex arguments etc. References ikllama.cpp Getting Started Guide (already out of date lol) Quant Cookers Guide Compiling triton-cpu fp8 to bf16 safetensors casting without GPU avx512 avxvnni Zen5 experimental optimizations ubergarm-imatrix-calibration-corpus-v02.txt
Qwen3-Coder-30B-A3B-Instruct-GGUF
`ikllama.cpp` imatrix Quantizations of Qwen/Qwen3-Coder-30B-A3B-Instruct This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first three are just test quants for baseline perplexity comparison: `bf16` 56.894 GiB (16.007 BPW) - Final estimate: PPL = 9.5334 +/- 0.07560 `Q80` 30.247 GiB (8.510 BPW) - Final estimate: PPL = 9.5317 +/- 0.07551 (NOTE lower than BF16 but didn't use it for "baseline"...) `Q40` 16.111 GiB (4.533 BPW) - Final estimate: PPL = 9.7225 +/- 0.07712 `IQ5K` 21.324 GiB (5.999 BPW) Final estimate: PPL = 9.5930 +/- 0.07614 `IQ4K` 17.878 GiB (5.030 BPW) Final estimate: PPL = 9.6023 +/- 0.07613 `IQ4KSS` 15.531 GiB (4.370 BPW) Final estimate: PPL = 9.6441 +/- 0.07648 `IQ3K` 14.509 GiB (4.082 BPW) Final estimate: PPL = 9.6849 +/- 0.0768 `IQ3KS` 13.633 GiB (3.836 BPW) Final estimate: PPL = 9.7940 +/- 0.07795 `IQ2KL` 11.516 GiB (3.240 BPW) Final estimate: PPL = 10.0475 +/- 0.08016 `IQ2KT` 9.469 GiB (2.664 BPW) Final estimate: PPL = 10.1352 +/- 0.08007 `IQ1KT` 7.583 GiB (2.133 BPW) Final estimate: PPL = 11.0592 +/- 0.08760 References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt
Qwen3-30B-A3B-Thinking-2507-GGUF
`ikllama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Thinking-2507 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first three are just test quants for baseline perplexity comparison: `bf16` 56.894 GiB (16.007 BPW) - Final estimate: PPL = 7.3149 +/- 0.05076 `Q80` 30.247 GiB (8.510 BPW) - Final estimate: PPL = 7.3284 +/- 0.05091 `Q40` 16.111 GiB (4.533 BPW) - Final estimate: PPL = 7.4534 +/- 0.05151 `IQ5K` 21.324 GiB (5.999 BPW) Final estimate: PPL = 7.3440 +/- 0.05091 `IQ4K` 17.878 GiB (5.030 BPW) Final estimate: PPL = 7.3634 +/- 0.05104 `IQ4KSS` 15.531 GiB (4.370 BPW) Final estimate: PPL = 7.3861 +/- 0.05128 `IQ4KT` 14.438 GiB (4.062 BPW) Final estimate: PPL = 7.5020 +/- 0.05230 Mostly pure IQ4KT meant for full GPU offload similar to turboderp-org/exllamav3 check out ArtusDev's HuggingFace Page for someh excellent EXL3 quants! `IQ3K` 14.509 GiB (4.082 BPW) Final estimate: PPL = 7.4360 +/- 0.05162 `IQ3KS` 13.633 GiB (3.836 BPW) Final estimate: PPL = 7.4959 +/- 0.05204 `IQ2KL` 11.516 GiB (3.240 BPW) Final estimate: PPL = 7.6992 +/- 0.05345 `IQ2KT` 9.469 GiB (2.664 BPW) Final estimate: PPL = 8.0207 +/- 0.05638 `IQ1KT` 7.583 GiB (2.133 BPW) Final estimate: PPL = 8.8341 +/- 0.06231 Quick Start Full GPU Offload with CUDA or Vulkan (for AMD GPUs) imatrix note I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so: References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt eaddario/imatrix-calibration
DeepSeek-V3.1-Terminus-GGUF
Kimi-K2-Instruct-0905-GGUF
`ikllama.cpp` imatrix Quantizations of moonshotai/Kimi-K2-Instruct-0905 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. For pre-built Windows binaries of ikllama.cpp check out Thireus' fork here. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! The current imatrix dat file seems to be missing entries for just the single dense layer and shared expert so all my recipes are using `q80` for those. For notes on tool calling api endpoints checkout details from this PR: https://github.com/ikawrakow/ikllama.cpp/pull/723 `smol` here simply means the routed experts recipe uses the same quantization for down as well as (gate|up) tensors. Quant Collection Compare with baseline perplexity of full size `Q80` 1016.117 GiB (8.504 BPW) `smol-IQ5KS` 632.664 GiB (5.295 BPW) Final estimate: PPL = 2.4526 +/- 0.01182 `smol-IQ4KSS` 485.008 GiB (4.059 BPW) Final estimate: PPL = 2.5185 +/- 0.01221 `IQ4KS` 553.624 GiB (4.633 BPW) Final estimate: PPL = 2.4641 +/- 0.01190 `IQ3KS` 420.558 GiB (3.520 BPW) Final estimate: PPL = 2.5640 +/- 0.01262 `smol-IQ3KS` 388.258 GiB (3.249 BPW) Final estimate: PPL = 2.5902 +/- 0.01284 `IQ2KL` 358.419 GiB (3.000 BPW) Final estimate: PPL = 2.7993 +/- 0.01416 `smol-IQ2KL` 329.195 GiB (2.755 BPW) Final estimate: PPL = 2.9294 +/- 0.01499 `IQ2KS` 289.820 GiB (2.425 BPW) Final estimate: PPL = 3.2478 +/- 0.01721 `smol-IQ2KS` 270.133 GiB (2.261 BPW) Final estimate: PPL = 3.4977 +/- 0.01924 `smol-IQ1KT` 218.936 GiB (1.832 BPW) Final estimate: PPL = 4.2224 +/- 0.02443
DeepSeek-R1-0528-GGUF
Qwen3-30B-A3B-Instruct-2507-GGUF
`ikllama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Instruct-2507 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison: `bf16` 56.894 GiB (16.007 BPW) - Final estimate: PPL = 7.3594 +/- 0.05170 `Q80` 30.247 GiB (8.510 BPW) - Final estimate: PPL = 7.3606 +/- 0.05171 `IQ5K` 21.324 GiB (5.999 BPW) Final estimate: PPL = 7.3806 +/- 0.05170 `IQ4K` 17.878 GiB (5.030 BPW) Final estimate: PPL = 7.3951 +/- 0.05178 `IQ4KSS` 15.531 GiB (4.370 BPW) Final estimate: PPL = 7.4392 +/- 0.05225 `IQ3K` 14.509 GiB (4.082 BPW) Final estimate: PPL = 7.4991 +/- 0.05269 `IQ3KS` 13.633 GiB (3.836 BPW) Final estimate: PPL = 7.5512 +/- 0.05307 `IQ2KL` 11.516 GiB (3.240 BPW) Final estimate: PPL = 7.7121 +/- 0.05402 `IQ2KT` 9.469 GiB (2.664 BPW) Final estimate: PPL = 8.0270 +/- 0.05698 `IQ1KT` 7.583 GiB (2.133 BPW) Final estimate: PPL = 8.7273 +/- 0.06185 Quick Start Full GPU Offload with CUDA or or Vulkan (for AMD GPUs) imatrix note I used @eaddario's eaddario-imatrix-corpus-combined-all-medium converted to text like so: References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt eaddario/imatrix-calibration
Kimi-K2-Instruct-GGUF
`ikllama.cpp` imatrix Quantizations of moonshotai/Kimi-K2-Instruct This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! UPDATED RECIPES Updated new better lower perplexity recipes and worlds smallest Kimi-K2-Instruct-smol-IQ1KT at 219.375 GIB (1.835) BPW. Please ask any questions in this discussion here, thanks! Old versions still available as described in the dicsussion at tag/revision v0.1. Quant Collection Compare with Perplexity of full size `Q80` 1016.623 GiB (8.504 BPW): v0.2 `IQ4KS` 554.421 GiB (4.638 BPW) Final estimate: PPL = 2.9584 +/- 0.01473 Special mix of `IQ4KS` `ffn(gate|up)exps` and `IQ5KS` `ffndownexps` routed experts. v0.2 `IQ3KS` 430.908 GiB (3.604 BPW) Final estimate: PPL = 3.0226 +/- 0.01518 Special mix of `IQ3KS` `ffn(gate|up)exps` and `IQ4KS` `ffndownexps` routed experts. v0.2 `IQ2KL` 349.389 GiB (2.923 BPW) Final estimate: PPL = 3.1813 +/- 0.01619 Special mix with brand new SOTA `IQ2KL` `ffn(gate|up)exps` and `IQ3KS` `ffndownexps` routed experts. v0.2 `smol-IQ2KL` 329.702 GiB (2.758 BPW) Final estimate: PPL = 3.4086 +/- 0.01773 Special mix of `IQ2KL` `ffn(gate|up)exps` and also `IQ2KL` `ffndownexps` routed experts. v0.2 `IQ2KS` 290.327 GiB (2.429 BPW) Final estimate: PPL = 3.6827 +/- 0.01957 Special mix with `IQ2KS` `ffn(gate|up)exps` and band new SOTA `IQ2KL` `ffndownexps` routed experts. v0.2 `IQ1KT` 234.141 GiB (1.959 BPW) Final estimate: PPL = 3.9734 +/- 0.02152 Special mix of `IQ1KT` `ffn(gate|up)exps` and `IQ2KT` `ffndownexps` routed experts. v0.2 `smol-IQ1KT` 219.375 GiB (1.835 BPW) Final estimate: PPL = 4.2187 +/- 0.02325 Special mix of `IQ1KT` `ffn(gate|up)exps` and also `IQ1KT` `ffndownexps` routed experts. References ikllama.cpp Getting Started Guide (already out of date lol) mainline llama.cpp PR gabriellarsion PR author test repo discussion
Qwen3-Coder-480B-A35B-Instruct-GGUF
Qwen3-235B-A22B-Instruct-2507-GGUF
`ikllama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B-Instruct-2507 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison: `bf16` 437.989 GiB (16.003 BPW) - Final estimate: PPL = 4.3079 +/- 0.02544 `Q80` 232.769 GiB (8.505 BPW) - Final estimate: PPL = 4.3139 +/- 0.02550 `IQ5K` 161.722 GiB (5.909 BPW) Final estimate: PPL = 4.3351 +/- 0.02566 `IQ4K` 134.183 GiB (4.903 BPW) Final estimate: PPL = 4.3668 +/- 0.02594 `pure-IQ4KS` 116.994 GiB (4.275 BPW) Final estimate: PPL = 4.4156 +/- 0.02624 `IQ4KSS` 115.085 GiB (4.205 BPW) Final estimate: PPL = 4.4017 +/- 0.02614 This one is a little funky just for fun. Seems smort! `IQ3K` 106.644 GiB (3.897 BPW) Final estimate: PPL = 4.4561 +/- 0.02657 `IQ3KS` 101.308 GiB (3.702 BPW) Final estimate: PPL = 4.4915 +/- 0.02685 `IQ2KL` 81.866 GiB (2.991 BPW) Final estimate: PPL = 4.7912 +/- 0.02910 Quick Start This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ikllama.cpp discussions or my other quants for more examples for multi-GPU etc. References ikllama.cpp Getting Started Guide (already out of date lol)
GLM-5-GGUF
Qwen3-235B-A22B-Thinking-2507-GGUF
`ikllama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B-Thinking-2507 This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quant Collection Perplexity computed against wiki.test.raw. These first two are just test quants for baseline perplexity comparison: `bf16` 437.989 GiB (16.003 BPW) - Final estimate: PPL = 4.1898 +/- 0.02367 `Q80` 232.769 GiB (8.505 BPW) - Final estimate: PPL = 4.1956 +/- 0.02371 `IQ5K` 161.722 GiB (5.909 BPW) Final estimate: PPL = 4.2213 +/- 0.02391 `IQ4K` 134.183 GiB (4.903 BPW) Final estimate: PPL = 4.2407 +/- 0.02406 `IQ4KSS` 114.093 GiB (4.169 BPW) Final estimate: PPL = 4.2799 +/- 0.02423 `IQ3K` 106.644 GiB (3.897 BPW) Final estimate: PPL = 4.3319 +/- 0.02470 `IQ3KS` 101.308 GiB (3.702 BPW) Final estimate: PPL = 4.3718 +/- 0.02509 `IQ2KL` 81.866 GiB (2.991 BPW) Final estimate: PPL = 4.6608 +/- 0.02720 Quick Start This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ikllama.cpp discussions or my other quants for more examples for multi-GPU etc. References ikllama.cpp Getting Started Guide (already out of date lol)
DeepSeek-V3-0324-GGUF
GLM-4.5-GGUF
Qwen3-235B-A22B-GGUF
gemma-3-27b-it-qat-GGUF
Hunyuan-A13B-Instruct-GGUF
DeepSeek-TNG-R1T2-Chimera-GGUF
`ikllama.cpp` imatrix Quantizations of DeepSeek-TNG-R1T2-Chimera This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Quants For some larger non-imatrix ik quant options check out Kebob/DeepSeek-TNG-R1T2-Chimera-IKGGUF `IQ3KS` 281.463 GiB (3.598 BPW) Special mix with all new `IQ3KS` `ffn(gate|up)exps` and `IQ4KS` `ffndownexps` routed experts. Mostly `iq5ks/iq4ks` for attn and shared expert. `iq5k` `tokenembd` and `iq6k` `output` "head". `IQ2KS` 203.553 GiB (2.602 BPW) Special mix with `IQ2KS` `ffn(gate|up)exps` and new `IQ3KS` `ffndownexps` routed experts. Mostly `iq5ks/iq4ks` for attn and shared expert. `iq5k` `tokenembd` and `iq6k` `output` "head". `IQ2KT` 171.146 GiB (2.188 BPW) Designed for RTX 6000 PRO Blackwell with 192GB total VRAM full offload with (hopefully) full 160k context and sufficiently large batch sizes. These `KT` quant types are quite fast on CUDA but not as fast TG on CPU inferencing. Special mix new trellis quants (like QTIP/EXL3 style) `IQ2KT` `ffn(gate|down|up)exps` routed experts. Mostly `iq4kt/iq3kt` for attn and shared expert. `iq4k` `tokenembd` and `iq5k` `output` "head". `IQ2XXS` 169.590 GiB (2.168 BPW) Not recommended, but should be faster and better quality than the IQ1S and okay with full offload on multi-GPU. It should be okay for hybrid CPU+GPU inference as well if this size is good for your rig. Probably want to choose the IQ2KT for full GPU offload. Special mix `IQ2XXS` `ffn(gate|up)exps` and `IQ2KS` `ffndownexps` routed experts. Mostly `iq4ks/iq3ks` for attn and shared expert. `iq4k` `tokenembd` and `iq5k` `output` "head". `IQ1S` 132.915 GiB (1.699 BPW) Not recommended. "For the desperate". If you can fit a larger model in RAM+VRAM choose a larger model as it might even run faster and will definitely have better perplexity (likely better quality). Special mix `IQ1S` `ffn(gate|up)exps` and `IQ1M` `ffndownexps` routed experts. Mostly `iq4ks/iq3ks` for attn and shared expert. `iq4k` `tokenembd` and `iq5k` `output` "head". Adjust `--threads` to be equal to number of physical cores. Refer to various discussions on my other models for multi-NUMA, dual socket, and varying `--threads` and `--threads-batch` for larger server rigs. If you OOM on VRAM, remove the additional `-ot "...=CUDA0"` or you can increase offload layers if you have more VRAM with multi-GPU targets e.g. `-ot "blk\.(5|6)\.ffn.=CUDA1" \`. Test out `-rtr` to run-time-repack tensors to `r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes. Note this disables mmap() so will need enough RAM to malloc all the non-offloaded weights on startup. Generally `-ub 2048 -b 2048` or `-ub 4096 -b 4096` can give much faster PP speeds at the cost of some additional VRAM. Test against leavng it at the default `-ub 512 -b 2048`. Use `llama-sweep-bench --warmup-batch ...` to benchmark various configurations with your hardware to report to the community! TODO - [ ] Given the `IQ1SR4` is not symmetric with `IQ1S` it doesn't work with `-rtr` so I might look into releasing an `R4` variant after some `llama-sweep-bench` testing. - [ ] Consider a slightly larger model? (gotta free up some disk space lol) References ikllama.cpp Larger ik quants available here: Kebob/DeepSeek-TNG-R1T2-Chimera-IKGGUF Getting Started Guide (already out of date lol)