ubergarm

41 models • 3 total models in database

Sort by:

MiniMax-M2.5-GGUF

Kimi-K2-Thinking-GGUF

imatrix Quantization of moonshotai/Kimi-K2-Thinking UPDATE: The `smol-IQ3KS` scored 77.3% on aider polyglot benchmark with 2x speed-up over similar sized mainline `UD-IQ3XXS`! Details in discussion 14 here. Thanks Fernanda24! The "full quality" baseline `Q4X` quant runs on both on mainline llama.cpp and ikllama.cpp. The other quants in this collection REQUIRE ikllama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! NOTE `ikllama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. Big Thanks Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking to get this out quickly! 🫶 and jukofyork for the `Q4X` patch! Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models! Finally, I really appreciate all the support from aifoundry.org so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants! Quant Collection Perplexity computed against wiki.test.raw. The `Q4X` version scores perplexity equivalent to a full 1TB Q80 test quant using a one line patch to adjust q40 to better fit the original QAT target quantization. Discussions ongoing on llama.cpp PR#17064 and directly with moonshot on their huggingface discussions ai as it seems they only used 15 of 16 possible 4bit values possibly? This is the "full quality" baseline version of the model and the only one in this collection with works on both ikllama.cpp and mainline llama.cpp. It does not use an imatrix and was created going from the original model to full bf16 before further quantization. The exact PR used is linked below in references. This quant was used to make the imatrix for the rest of the collection. smol-IQ4KSS 485.008 GiB (4.059 BPW) Final estimate: PPL = 2.1343 +/- 0.00934 IQ3K 459.432 GiB (3.845 BPW) Final estimate: PPL = 2.1456 +/- 0.00941 NOTE: Given there were some issues with the original q40 quantization, I've replaced the original IQ3K with this new smaller one using the patched q4x quantization. The original one was `474.772 GiB (3.973 BPW)` and will be squash deleted to save on public quota soon. This new one uses q4x patched and only applies imatrix to the iq3k tensors but not to the q80 or q4x. More details in discussion 4 here. It has almost the same perplexity so a good improvement. smol-IQ3KS 388.258 GiB (3.249 BPW) Final estimate: PPL = 2.2363 +/- 0.01004 IQ2KL 348.883 GiB (2.920 BPW) Final estimate: PPL = 2.3735 +/- 0.01082 smol-IQ2KL 329.195 GiB (2.755 BPW) Final estimate: PPL = 2.4550 +/- 0.01129 smol-IQ2KS 270.133 GiB (2.261 BPW) Final estimate: PPL = 2.9361 +/- 0.01451 smol-IQ1KT 218.936 GiB (1.832 BPW) Final estimate: PPL = 3.5931 +/- 0.01889 Also keep in mind `KT` trellis quants generally are slower during TG given likely compute bottleneck if running on CPU, but if it is all you can fit then well... Quick Start You will want to override the template given they patched the original template here: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/chattemplate.jinja You can do stuff like `--jinja --chat-template-file ./models/templates/Kimi-K2-Thinking.jinja`. You will also need to pass `--special` for it to output ` ` and` ` tags correctly depending on endpoint and client used, thanks u/Melodic-Network4374 but note it will then also print out ` ` so you can set your client to use that as a stop string. If you don't have enough RAM+VRAM, remove `--no-mmap` to mmap() "troll rig" it paging weights read-only off of disk for a couple tok/sec maybe depending. Adjust `--threads` and `--threads-batch` as needed. For smaller CPUs I recommend setting them both the same equal to the number of physical cores. For an amd 9950x that would be `-t 16` for example. Experiment on larger rigs especially with multiple socket NUMA considerations (avoid cross-NUMA memory access if possible). With ikllama.cpp you can get some extra VRAM by using `-amb 512` to fix the size of the MLA computation buffers. (only works on models with MLA style attention like Kimi-K2 and DeepSeek) References ikllama.cpp Getting Started Guide (already out of date lol) ubergarm-imatrix-calibration-corpus-v02.txt moonshotai/Kimi-K2-Thinking/discussions/2 vllm-project/compressed-tensors/issues/511 llama.cpp PR#17069

ubergarm

MiniMax-M2.5-GGUF

Kimi-K2-Thinking-GGUF

Qwen3.5-122B-A10B-GGUF

GLM-5.1-GGUF

Qwen3.5-35B-A3B-GGUF

Step-3.5-Flash-GGUF

Qwen3.5-397B-A17B-GGUF

GLM-4.7-Flash-GGUF

GLM-4.6-GGUF

Qwen3.5-27B-GGUF

Ling-1T-GGUF

MiniMax-M2.7-GGUF

Qwen3-Coder-Next-GGUF

GLM-4.5-Air-GGUF

DeepSeek V3.1 GGUF

Qwen3-Coder-30B-A3B-Instruct-GGUF

Qwen3-30B-A3B-Thinking-2507-GGUF

DeepSeek-V3.1-Terminus-GGUF

Kimi-K2-Instruct-0905-GGUF

DeepSeek-R1-0528-GGUF

Qwen3-30B-A3B-Instruct-2507-GGUF

Kimi-K2-Instruct-GGUF

Qwen3-Coder-480B-A35B-Instruct-GGUF

Qwen3-235B-A22B-Instruct-2507-GGUF

GLM-5-GGUF

Qwen3-235B-A22B-Thinking-2507-GGUF

DeepSeek-V3-0324-GGUF

GLM-4.5-GGUF

Qwen3-235B-A22B-GGUF

gemma-3-27b-it-qat-GGUF

Hunyuan-A13B-Instruct-GGUF

DeepSeek-TNG-R1T2-Chimera-GGUF

Qwen3-30B-A3B-GGUF

Kimi-Dev-72B-GGUF

Qwen3-14B-GGUF

GigaChat3-10B-A1.8B-GGUF

MiMo-V2-Flash-GGUF

GLM-4.7-GGUF

DeepSeek-R1T-Chimera-GGUF

Kimi-K2.5-GGUF

DeepSeek-V3.2-Speciale-GGUF