`ikllama.cpp` imatrix Quantizations of inclusionAI/Ling-1T This quant collection REQUIRES ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Finally, I appreciate all the support from aifoundry.org and team as well as huggingface for hosting all these big quants!
Quant Collection Perplexity computed against wiki.test.raw.
This one is just a test quant for baseline perplexity comparison: `Q80` 989.678 GiB (8.504 BPW) - Final estimate: PPL = 1.9859 +/- 0.00907
IQ5K 689.866 GiB (5.928 BPW) Final estimate: PPL = 1.9897 +/- 0.00910
smol-IQ4KSS 471.923 GiB (4.055 BPW) Final estimate: PPL = 2.0176 +/- 0.00927
smol-IQ3KS 378.853 GiB (3.255 BPW) Final estimate: PPL = 2.0770 +/- 0.00968
IQ2K 330.923 GiB (2.843 BPW) Final estimate: PPL = PPL = 2.2169 +/- 0.01055
This will use full q80 for VRAM layers and likely suit 384 RAM/VRAM.
smol-IQ2KS 264.984 GiB (2.277 BPW) Final estimate: PPL = 2.4429 +/- 0.01191
Should hopefully fit in 250 GiB RAM + 15 GiB VRAM + kv-cache/context...🤞
Leaving the `attn.`/first 4 dense layers/shexp at full q80 would take about 20.1 GiB VRAM which is how the `iqNk` quants are done.
smol-IQ2XXS 249.92 GiB (2.15 BPW) Final estimate: PPL = 2.5870 +/- 0.01279
This is a rare mainline compatible quant I released for folks to test this PR: https://github.com/ggml-org/llama.cpp/pull/16063
smol-IQ1KT 215.423 GiB (1.851 BPW) Final estimate: PPL = 2.8581 +/- 0.01471
One of the smallest yet functional quants available, but keep in mind KT types can be slower on CPU inferencing due to likely being computed bottlenecked calculating trellis during TG. Still worth a try if this is all your rig can fit!
References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt ikllama.cpp PR833 mainline llama.cpp PR16063