bartowski

500 models • 42 total models in database
Sort by:

Llama-3.2-3B-Instruct-GGUF

--- base_model: meta-llama/Llama-3.2-3B-Instruct language: - en - de - fr - it - pt - hi - es - th license: llama3.2 pipeline_tag: text-generation tags: - facebook - meta - llama - llama-3 quantized_by: bartowski extra_gated_prompt: "### LLAMA 3.2 COMMUNITY LICENSE AGREEMENT\n\nLlama 3.2 Version\ \ Release Date: September 25, 2024\n\n“Agreement” means the terms and conditions\ \ for use, reproduction, distribution and modification of the Llama Materials set\ \ forth herein.\n\n“Documentation” m

NaNK
llama
229,749
172

google_gemma-4-31B-it-GGUF

NaNK
license:apache-2.0
149,725
41

Qwen_Qwen3.5-397B-A17B-GGUF

NaNK
license:apache-2.0
146,026
3

google_gemma-4-26B-A4B-it-GGUF

NaNK
license:apache-2.0
125,184
66

Qwen_Qwen3.5-35B-A3B-GGUF

NaNK
license:apache-2.0
121,932
26

Meta-Llama-3.1-8B-Instruct-GGUF

--- base_model: meta-llama/Meta-Llama-3.1-8B-Instruct language: - en - de - fr - it - pt - hi - es - th license: llama3.1 pipeline_tag: text-generation tags: - facebook - meta - pytorch - llama - llama-3 quantized_by: bartowski extra_gated_prompt: "### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\nLlama 3.1 Version\ \ Release Date: July 23, 2024\n\"Agreement\" means the terms and conditions for\ \ use, reproduction, distribution and modification of the Llama Materials set forth\ \ herein.\n\"Documenta

NaNK
llama
101,898
274

google_gemma-4-E4B-it-GGUF

NaNK
license:apache-2.0
97,712
31

gemma-2-2b-it-GGUF

NaNK
96,544
79

Qwen_Qwen3.5-27B-GGUF

NaNK
license:apache-2.0
80,069
28

google_gemma-4-E2B-it-GGUF

NaNK
license:apache-2.0
63,296
14

Llama-3.2-1B-Instruct-GGUF

--- base_model: meta-llama/Llama-3.2-1B-Instruct language: - en - de - fr - it - pt - hi - es - th license: llama3.2 pipeline_tag: text-generation tags: - facebook - meta - llama - llama-3 quantized_by: bartowski extra_gated_prompt: "### LLAMA 3.2 COMMUNITY LICENSE AGREEMENT\n\nLlama 3.2 Version\ \ Release Date: September 25, 2024\n\n“Agreement” means the terms and conditions\ \ for use, reproduction, distribution and modification of the Llama Materials set\ \ forth herein.\n\n“Documentation” m

NaNK
llama
57,137
140

Qwen_Qwen3.5-9B-GGUF

NaNK
license:apache-2.0
54,282
18

Qwen_Qwen3.5-122B-A10B-GGUF

NaNK
license:apache-2.0
39,491
11

openai_gpt-oss-20b-GGUF

Llamacpp imatrix Quantizations of gpt-oss-20b by openai Original model: https://huggingface.co/openai/gpt-oss-20b All quants made using imatrix option with combinedallmedium dataset from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project All quants keep the feed forward networks at mxfp4 for optimal performance, which does mean the size differences are negligible unfortunately, but being provided just because. No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-20b-MXFP4.gguf | MXFP4 | 12.1GB | false | Full MXFP4 weights, recommended for this model. | The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything. The rest of these are provided for your own interest in case you feel like experimenting, but the size savings is basically non-existent so I would not recommend running them, they are provided simply for show: | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-20b-Q6KL.gguf | Q6KL | 12.04GB | false | Uses Q80 for embed and output weights. Q6K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q6K.gguf | Q6K | 12.04GB | false | Q6K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q5KL.gguf | Q5KL | 11.91GB | false | Uses Q80 for embed and output weights. Q5K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q4KL.gguf | Q4KL | 11.89GB | false | Uses Q80 for embed and output weights. Q4K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q2KL.gguf | Q2KL | 11.85GB | false | Uses Q80 for embed and output weights. Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KXL.gguf | Q3KXL | 11.78GB | false | Uses Q80 for embed and output weights. Q3KL with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q5KM.gguf | Q5KM | 11.73GB | false | Q5KM with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q5KS.gguf | Q5KS | 11.72GB | false | Q5KS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q4KM.gguf | Q4KM | 11.67GB | false | Q4KM with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q4KS.gguf | Q4KS | 11.67GB | false | Q4KS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q41.gguf | Q41 | 11.59GB | false | Q41 with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ4NL.gguf | IQ4NL | 11.56GB | false | IQ4NL with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ4XS.gguf | IQ4XS | 11.56GB | false | IQ4XS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KM.gguf | Q3KM | 11.56GB | false | Q3KM with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ3M.gguf | IQ3M | 11.56GB | false | IQ3M with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ3XS.gguf | IQ3XS | 11.56GB | false | IQ3XS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ3XXS.gguf | IQ3XXS | 11.56GB | false | IQ3XXS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q2K.gguf | Q2K | 11.56GB | false | Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KS.gguf | Q3KS | 11.55GB | false | Q3KS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2M.gguf | IQ2M | 11.55GB | false | IQ2M with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2S.gguf | IQ2S | 11.55GB | false | IQ2S with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q40.gguf | Q40 | 11.52GB | false | Q40 with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2XS.gguf | IQ2XS | 11.51GB | false | IQ2XS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2XXS.gguf | IQ2XXS | 11.51GB | false | IQ2XXS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KL.gguf | Q3KL | 11.49GB | false | Q3KL with all FFN kept at MXFP4MOE. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (openaigpt-oss-20b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
38,238
24

cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF

Llamacpp imatrix Quantizations of Dolphin-Mistral-24B-Venice-Edition by cognitivecomputations

NaNK
license:apache-2.0
34,005
87

Llama-3.2-3B-Instruct-uncensored-GGUF

NaNK
base_model:chuanli11/Llama-3.2-3B-Instruct-uncensored
30,273
71

Qwen2.5-7B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen2.5-7B-Instruct Original model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen2.5-7B-Instruct-f16.gguf | f16 | 15.24GB | false | Full F16 weights. | | Qwen2.5-7B-Instruct-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen2.5-7B-Instruct-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen2.5-7B-Instruct-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | Qwen2.5-7B-Instruct-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen2.5-7B-Instruct-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | Qwen2.5-7B-Instruct-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | Qwen2.5-7B-Instruct-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen2.5-7B-Instruct-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for must use cases, recommended. | | Qwen2.5-7B-Instruct-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen2.5-7B-Instruct-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | Qwen2.5-7B-Instruct-Q40.gguf | Q40 | 4.44GB | false | Legacy format, generally not worth using over similarly sized formats | | Qwen2.5-7B-Instruct-Q4088.gguf | Q4088 | 4.43GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). | | Qwen2.5-7B-Instruct-Q4048.gguf | Q4048 | 4.43GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). | | Qwen2.5-7B-Instruct-Q4044.gguf | Q4044 | 4.43GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. | | Qwen2.5-7B-Instruct-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen2.5-7B-Instruct-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | Qwen2.5-7B-Instruct-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | Qwen2.5-7B-Instruct-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen2.5-7B-Instruct-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen2.5-7B-Instruct-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | Qwen2.5-7B-Instruct-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen2.5-7B-Instruct-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | Qwen2.5-7B-Instruct-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Qwen2.5-7B-Instruct-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:apache-2.0
29,803
35

Qwen_Qwen3.5-4B-GGUF

NaNK
license:apache-2.0
29,686
13

Phi-3.5-mini-instruct-GGUF

license:mit
28,471
69

TheDrummer_Cydonia-24B-v4.2.0-GGUF

Llamacpp imatrix Quantizations of Cydonia-24B-v4.2.0 by TheDrummer Original model: https://huggingface.co/TheDrummer/Cydonia-24B-v4.2.0 All quants made using imatrix option with dataset from here c...

NaNK
24,695
13

Qwen2.5-Coder-32B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen2.5-Coder-32B-Instruct Original model: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen2.5-Coder-32B-Instruct-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen2.5-Coder-32B-Instruct-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen2.5-Coder-32B-Instruct-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | Qwen2.5-Coder-32B-Instruct-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | Qwen2.5-Coder-32B-Instruct-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | Qwen2.5-Coder-32B-Instruct-Q40.gguf | Q40 | 18.71GB | false | Legacy format, generally not worth using over similarly sized formats | | Qwen2.5-Coder-32B-Instruct-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. | | Qwen2.5-Coder-32B-Instruct-Q4088.gguf | Q4088 | 18.64GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows. | | Qwen2.5-Coder-32B-Instruct-Q4048.gguf | Q4048 | 18.64GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows. | | Qwen2.5-Coder-32B-Instruct-Q4044.gguf | Q4044 | 18.64GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows. | | Qwen2.5-Coder-32B-Instruct-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen2.5-Coder-32B-Instruct-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen2.5-Coder-32B-Instruct-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | Qwen2.5-Coder-32B-Instruct-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | Qwen2.5-Coder-32B-Instruct-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen2.5-Coder-32B-Instruct-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen2.5-Coder-32B-Instruct-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen2.5-Coder-32B-Instruct-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen2.5-Coder-32B-Instruct-IQ3XXS.gguf | IQ3XXS | 12.84GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen2.5-Coder-32B-Instruct-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | Qwen2.5-Coder-32B-Instruct-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen2.5-Coder-32B-Instruct-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen2.5-Coder-32B-Instruct-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen2.5-Coder-32B-Instruct-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Qwen2.5-Coder-32B-Instruct-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:apache-2.0
23,038
92

Mistral-Nemo-Instruct-2407-GGUF

Llamacpp imatrix Quantizations of Mistral-Nemo-Instruct-2407 Original model: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Mistral-Nemo-Instruct-2407-f16.gguf | f16 | 24.50GB | false | Full F16 weights. | | Mistral-Nemo-Instruct-2407-Q80.gguf | Q80 | 13.02GB | false | Extremely high quality, generally unneeded but max available quant. | | Mistral-Nemo-Instruct-2407-Q6KL.gguf | Q6KL | 10.38GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Mistral-Nemo-Instruct-2407-Q6K.gguf | Q6K | 10.06GB | false | Very high quality, near perfect, recommended. | | Mistral-Nemo-Instruct-2407-Q5KL.gguf | Q5KL | 9.14GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Mistral-Nemo-Instruct-2407-Q5KM.gguf | Q5KM | 8.73GB | false | High quality, recommended. | | Mistral-Nemo-Instruct-2407-Q5KS.gguf | Q5KS | 8.52GB | false | High quality, recommended. | | Mistral-Nemo-Instruct-2407-Q4KL.gguf | Q4KL | 7.98GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Mistral-Nemo-Instruct-2407-Q4KM.gguf | Q4KM | 7.48GB | false | Good quality, default size for must use cases, recommended. | | Mistral-Nemo-Instruct-2407-Q3KXL.gguf | Q3KXL | 7.15GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Mistral-Nemo-Instruct-2407-Q4KS.gguf | Q4KS | 7.12GB | false | Slightly lower quality with more space savings, recommended. | | Mistral-Nemo-Instruct-2407-Q40.gguf | Q40 | 7.09GB | false | Legacy format, generally not worth using over similarly sized formats | | Mistral-Nemo-Instruct-2407-Q4088.gguf | Q4088 | 7.07GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows. | | Mistral-Nemo-Instruct-2407-Q4048.gguf | Q4048 | 7.07GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows. | | Mistral-Nemo-Instruct-2407-Q4044.gguf | Q4044 | 7.07GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows. | | Mistral-Nemo-Instruct-2407-IQ4XS.gguf | IQ4XS | 6.74GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Mistral-Nemo-Instruct-2407-Q3KL.gguf | Q3KL | 6.56GB | false | Lower quality but usable, good for low RAM availability. | | Mistral-Nemo-Instruct-2407-Q3KM.gguf | Q3KM | 6.08GB | false | Low quality. | | Mistral-Nemo-Instruct-2407-IQ3M.gguf | IQ3M | 5.72GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Mistral-Nemo-Instruct-2407-Q3KS.gguf | Q3KS | 5.53GB | false | Low quality, not recommended. | | Mistral-Nemo-Instruct-2407-Q2KL.gguf | Q2KL | 5.45GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Mistral-Nemo-Instruct-2407-IQ3XS.gguf | IQ3XS | 5.31GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Mistral-Nemo-Instruct-2407-Q2K.gguf | Q2K | 4.79GB | false | Very low quality but surprisingly usable. | | Mistral-Nemo-Instruct-2407-IQ2M.gguf | IQ2M | 4.44GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Mistral-Nemo-Instruct-2407-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

license:apache-2.0
22,582
106

DeepSeek-R1-Distill-Qwen-32B-abliterated-GGUF

Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-32B-abliterated Original model: https://huggingface.co/huihui-ai/DeepSeek-R1-Distill-Qwen-32B-abliterated All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-32B-abliterated-bf16.gguf | bf16 | 65.54GB | true | Full BF16 weights. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q40.gguf | Q40 | 18.71GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-32B-abliterated-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
22,284
122

magnum-v4-12b-GGUF

NaNK
22,249
7

Qwen2.5-72B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen2.5-72B-Instruct Original model: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen2.5-72B-Instruct-Q80.gguf | Q80 | 77.26GB | true | Extremely high quality, generally unneeded but max available quant. | | Qwen2.5-72B-Instruct-Q6K.gguf | Q6K | 64.35GB | true | Very high quality, near perfect, recommended. | | Qwen2.5-72B-Instruct-Q5KM.gguf | Q5KM | 54.45GB | true | High quality, recommended. | | Qwen2.5-72B-Instruct-Q4KM.gguf | Q4KM | 47.42GB | false | Good quality, default size for must use cases, recommended. | | Qwen2.5-72B-Instruct-Q40.gguf | Q40 | 41.38GB | false | Legacy format, generally not worth using over similarly sized formats | | Qwen2.5-72B-Instruct-Q3KXL.gguf | Q3KXL | 40.60GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen2.5-72B-Instruct-IQ4XS.gguf | IQ4XS | 39.71GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen2.5-72B-Instruct-Q3KL.gguf | Q3KL | 39.51GB | false | Lower quality but usable, good for low RAM availability. | | Qwen2.5-72B-Instruct-Q3KM.gguf | Q3KM | 37.70GB | false | Low quality. | | Qwen2.5-72B-Instruct-IQ3M.gguf | IQ3M | 35.50GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen2.5-72B-Instruct-Q3KS.gguf | Q3KS | 34.49GB | false | Low quality, not recommended. | | Qwen2.5-72B-Instruct-IQ3XXS.gguf | IQ3XXS | 31.85GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen2.5-72B-Instruct-Q2KL.gguf | Q2KL | 31.03GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen2.5-72B-Instruct-Q2K.gguf | Q2K | 29.81GB | false | Very low quality but surprisingly usable. | | Qwen2.5-72B-Instruct-IQ2M.gguf | IQ2M | 29.34GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen2.5-72B-Instruct-IQ2XS.gguf | IQ2XS | 27.06GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen2.5-72B-Instruct-IQ2XXS.gguf | IQ2XXS | 25.49GB | false | Very low quality, uses SOTA techniques to be usable. | | Qwen2.5-72B-Instruct-IQ1M.gguf | IQ1M | 23.74GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Qwen2.5-72B-Instruct-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
22,162
38

Qwen_Qwen3-Coder-Next-GGUF

license:apache-2.0
20,710
11

Meta-Llama-3-8B-Instruct-GGUF

NaNK
llama
19,287
104

NemoMix-Unleashed-12B-GGUF

NaNK
19,127
111

mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF

NaNK
license:apache-2.0
18,772
35

DeepSeek-R1-Distill-Qwen-1.5B-GGUF

Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-1.5B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-1.5B-f32.gguf | f32 | 7.11GB | false | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-1.5B-f16.gguf | f16 | 3.56GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-1.5B-Q80.gguf | Q80 | 1.89GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-1.5B-Q6KL.gguf | Q6KL | 1.58GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q6K.gguf | Q6K | 1.46GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KL.gguf | Q5KL | 1.43GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KM.gguf | Q5KM | 1.29GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KL.gguf | Q4KL | 1.29GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KS.gguf | Q5KS | 1.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KXL.gguf | Q3KXL | 1.18GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-1.5B-Q41.gguf | Q41 | 1.16GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KM.gguf | Q4KM | 1.12GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KS.gguf | Q4KS | 1.07GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q40.gguf | Q40 | 1.07GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ4NL.gguf | IQ4NL | 1.07GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ4XS.gguf | IQ4XS | 1.02GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KL.gguf | Q3KL | 0.98GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-1.5B-Q2KL.gguf | Q2KL | 0.98GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KM.gguf | Q3KM | 0.92GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ3M.gguf | IQ3M | 0.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KS.gguf | Q3KS | 0.86GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-1.5B-Q2K.gguf | Q2K | 0.75GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-1.5B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
18,491
78

SmolLM2-1.7B-Instruct-GGUF

NaNK
license:apache-2.0
17,492
18

Alibaba-NLP_Tongyi-DeepResearch-30B-A3B-GGUF

Llamacpp imatrix Quantizations of Tongyi-DeepResearch-30B-A3B by Alibaba-NLP Original model: https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Tongyi-DeepResearch-30B-A3B-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Tongyi-DeepResearch-30B-A3B-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Tongyi-DeepResearch-30B-A3B-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Tongyi-DeepResearch-30B-A3B-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Tongyi-DeepResearch-30B-A3B-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Tongyi-DeepResearch-30B-A3B-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Tongyi-DeepResearch-30B-A3B-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Tongyi-DeepResearch-30B-A3B-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Tongyi-DeepResearch-30B-A3B-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Tongyi-DeepResearch-30B-A3B-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Tongyi-DeepResearch-30B-A3B-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Tongyi-DeepResearch-30B-A3B-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Tongyi-DeepResearch-30B-A3B-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Tongyi-DeepResearch-30B-A3B-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Tongyi-DeepResearch-30B-A3B-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Tongyi-DeepResearch-30B-A3B-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Tongyi-DeepResearch-30B-A3B-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Tongyi-DeepResearch-30B-A3B-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Tongyi-DeepResearch-30B-A3B-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Tongyi-DeepResearch-30B-A3B-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Tongyi-DeepResearch-30B-A3B-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Tongyi-DeepResearch-30B-A3B-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Tongyi-DeepResearch-30B-A3B-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Alibaba-NLPTongyi-DeepResearch-30B-A3B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
17,113
16

google_gemma-3-4b-it-GGUF

NaNK
16,582
27

cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF

NaNK
16,579
21

Meta-Llama-3.1-70B-Instruct-GGUF

NaNK
llama
16,332
65

TheDrummer_Cydonia-24B-v4.1-GGUF

Llamacpp imatrix Quantizations of Cydonia-24B-v4.1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Cydonia-24B-v4.1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Cydonia-24B-v4.1-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Cydonia-24B-v4.1-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Cydonia-24B-v4.1-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Cydonia-24B-v4.1-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Cydonia-24B-v4.1-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Cydonia-24B-v4.1-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Cydonia-24B-v4.1-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Cydonia-24B-v4.1-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Cydonia-24B-v4.1-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Cydonia-24B-v4.1-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Cydonia-24B-v4.1-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Cydonia-24B-v4.1-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Cydonia-24B-v4.1-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Cydonia-24B-v4.1-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Cydonia-24B-v4.1-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Cydonia-24B-v4.1-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Cydonia-24B-v4.1-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Cydonia-24B-v4.1-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Cydonia-24B-v4.1-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Cydonia-24B-v4.1-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Cydonia-24B-v4.1-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Cydonia-24B-v4.1-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Cydonia-24B-v4.1-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Cydonia-24B-v4.1-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Cydonia-24B-v4.1-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Cydonia-24B-v4.1-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerCydonia-24B-v4.1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
14,974
25

mlabonne_Qwen3-14B-abliterated-GGUF

Llamacpp imatrix Quantizations of Qwen3-14B-abliterated by mlabonne Original model: https://huggingface.co/mlabonne/Qwen3-14B-abliterated All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-14B-abliterated-bf16.gguf | bf16 | 29.54GB | false | Full BF16 weights. | | Qwen3-14B-abliterated-Q80.gguf | Q80 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-14B-abliterated-Q6KL.gguf | Q6KL | 12.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-14B-abliterated-Q6K.gguf | Q6K | 12.12GB | false | Very high quality, near perfect, recommended. | | Qwen3-14B-abliterated-Q5KL.gguf | Q5KL | 10.99GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-14B-abliterated-Q5KM.gguf | Q5KM | 10.51GB | false | High quality, recommended. | | Qwen3-14B-abliterated-Q5KS.gguf | Q5KS | 10.26GB | false | High quality, recommended. | | Qwen3-14B-abliterated-Q4KL.gguf | Q4KL | 9.58GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-14B-abliterated-Q41.gguf | Q41 | 9.39GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-14B-abliterated-Q4KM.gguf | Q4KM | 9.00GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-14B-abliterated-Q3KXL.gguf | Q3KXL | 8.58GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-14B-abliterated-Q4KS.gguf | Q4KS | 8.57GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-14B-abliterated-Q40.gguf | Q40 | 8.54GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-14B-abliterated-IQ4NL.gguf | IQ4NL | 8.54GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-14B-abliterated-IQ4XS.gguf | IQ4XS | 8.11GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-14B-abliterated-Q3KL.gguf | Q3KL | 7.90GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-14B-abliterated-Q3KM.gguf | Q3KM | 7.32GB | false | Low quality. | | Qwen3-14B-abliterated-IQ3M.gguf | IQ3M | 6.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-14B-abliterated-Q3KS.gguf | Q3KS | 6.66GB | false | Low quality, not recommended. | | Qwen3-14B-abliterated-Q2KL.gguf | Q2KL | 6.51GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-14B-abliterated-IQ3XS.gguf | IQ3XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-14B-abliterated-IQ3XXS.gguf | IQ3XXS | 5.94GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-14B-abliterated-Q2K.gguf | Q2K | 5.75GB | false | Very low quality but surprisingly usable. | | Qwen3-14B-abliterated-IQ2M.gguf | IQ2M | 5.32GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-14B-abliterated-IQ2S.gguf | IQ2S | 4.96GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mlabonneQwen3-14B-abliterated-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:apache-2.0
14,883
27

DeepSeek-R1-Distill-Llama-8B-GGUF

NaNK
base_model:deepseek-ai/DeepSeek-R1-Distill-Llama-8B
14,753
50

DeepSeek-R1-Distill-Qwen-7B-GGUF

Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-7B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-7B-f32.gguf | f32 | 30.47GB | false | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-7B-f16.gguf | f16 | 15.24GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-7B-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-7B-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q41.gguf | Q41 | 4.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-7B-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-7B-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q40.gguf | Q40 | 4.44GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-7B-IQ4NL.gguf | IQ4NL | 4.44GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-7B-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-7B-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-7B-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-7B-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-7B-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-7B-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-7B-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-7B-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-7B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
14,570
101

TheDrummer_Magidonia-24B-v4.2.0-GGUF

Llamacpp imatrix Quantizations of Magidonia-24B-v4.2.0 by TheDrummer Original model: https://huggingface.co/TheDrummer/Magidonia-24B-v4.2.0 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Magidonia-24B-v4.2.0-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Magidonia-24B-v4.2.0-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Magidonia-24B-v4.2.0-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Magidonia-24B-v4.2.0-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Magidonia-24B-v4.2.0-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Magidonia-24B-v4.2.0-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Magidonia-24B-v4.2.0-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Magidonia-24B-v4.2.0-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Magidonia-24B-v4.2.0-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Magidonia-24B-v4.2.0-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Magidonia-24B-v4.2.0-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Magidonia-24B-v4.2.0-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Magidonia-24B-v4.2.0-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Magidonia-24B-v4.2.0-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Magidonia-24B-v4.2.0-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Magidonia-24B-v4.2.0-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Magidonia-24B-v4.2.0-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Magidonia-24B-v4.2.0-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Magidonia-24B-v4.2.0-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Magidonia-24B-v4.2.0-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Magidonia-24B-v4.2.0-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Magidonia-24B-v4.2.0-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Magidonia-24B-v4.2.0-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Magidonia-24B-v4.2.0-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Magidonia-24B-v4.2.0-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Magidonia-24B-v4.2.0-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerMagidonia-24B-v4.2.0-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
14,482
6

Qwen_Qwen3.5-0.8B-GGUF

NaNK
license:apache-2.0
14,389
6

DeepSeek-R1-Distill-Qwen-14B-GGUF

Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-14B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-14B-f32.gguf | f32 | 59.09GB | true | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-14B-f16.gguf | f16 | 29.55GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-14B-Q80.gguf | Q80 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-14B-Q6KL.gguf | Q6KL | 12.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q6K.gguf | Q6K | 12.12GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q5KL.gguf | Q5KL | 10.99GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q5KM.gguf | Q5KM | 10.51GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q5KS.gguf | Q5KS | 10.27GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q4KL.gguf | Q4KL | 9.57GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q41.gguf | Q41 | 9.39GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-14B-Q4KM.gguf | Q4KM | 8.99GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q3KXL.gguf | Q3KXL | 8.61GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-14B-Q4KS.gguf | Q4KS | 8.57GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-14B-IQ4NL.gguf | IQ4NL | 8.55GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-14B-Q40.gguf | Q40 | 8.54GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-14B-IQ4XS.gguf | IQ4XS | 8.12GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q3KL.gguf | Q3KL | 7.92GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-14B-Q3KM.gguf | Q3KM | 7.34GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-14B-IQ3M.gguf | IQ3M | 6.92GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-14B-Q3KS.gguf | Q3KS | 6.66GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q2KL.gguf | Q2KL | 6.53GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ3XS.gguf | IQ3XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-14B-Q2K.gguf | Q2K | 5.77GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ2M.gguf | IQ2M | 5.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ2S.gguf | IQ2S | 5.00GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ2XS.gguf | IQ2XS | 4.70GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-14B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
13,282
214

PocketDoc_Dans-PersonalityEngine-V1.2.0-24b-GGUF

NaNK
license:apache-2.0
13,088
28

Qwen2.5-14B-Instruct-GGUF

NaNK
license:apache-2.0
12,890
44

Qwen_Qwen3.5-2B-GGUF

NaNK
license:apache-2.0
11,885
7

gemma-2-9b-it-GGUF

Original model: https://huggingface.co/google/gemma-2-9b-it All quants made using imatrix option with dataset from here Note that this model does not support a System prompt. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gemma-2-9b-it-f32.gguf | f32 | 36.97GB | false | Full F32 weights. | | gemma-2-9b-it-Q80.gguf | Q80 | 9.83GB | false | Extremely high quality, generally unneeded but max available quant. | | gemma-2-9b-it-Q6KL.gguf | Q6KL | 7.81GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | gemma-2-9b-it-Q6K.gguf | Q6K | 7.59GB | false | Very high quality, near perfect, recommended. | | gemma-2-9b-it-Q5KL.gguf | Q5KL | 6.87GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | gemma-2-9b-it-Q5KM.gguf | Q5KM | 6.65GB | false | High quality, recommended. | | gemma-2-9b-it-Q5KS.gguf | Q5KS | 6.48GB | false | High quality, recommended. | | gemma-2-9b-it-Q4KL.gguf | Q4KL | 5.98GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | gemma-2-9b-it-Q4KM.gguf | Q4KM | 5.76GB | false | Good quality, default size for must use cases, recommended. | | gemma-2-9b-it-Q4KS.gguf | Q4KS | 5.48GB | false | Slightly lower quality with more space savings, recommended. | | gemma-2-9b-it-IQ4XS.gguf | IQ4XS | 5.18GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | gemma-2-9b-it-Q3KL.gguf | Q3KL | 5.13GB | false | Lower quality but usable, good for low RAM availability. | | gemma-2-9b-it-Q3KM.gguf | Q3KM | 4.76GB | false | Low quality. | | gemma-2-9b-it-IQ3M.gguf | IQ3M | 4.49GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | gemma-2-9b-it-Q3KS.gguf | Q3KS | 4.34GB | false | Low quality, not recommended. | | gemma-2-9b-it-IQ3XS.gguf | IQ3XS | 4.14GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | gemma-2-9b-it-Q2KL.gguf | Q2KL | 4.03GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | gemma-2-9b-it-Q2K.gguf | Q2K | 3.81GB | false | Very low quality but surprisingly usable. | | gemma-2-9b-it-IQ3XXS.gguf | IQ3XXS | 3.80GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | gemma-2-9b-it-IQ2M.gguf | IQ2M | 3.43GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (gemma-2-9b-it-Q80) or download them all in place (./) A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
11,813
216

google_gemma-3-27b-it-qat-GGUF

NaNK
11,379
32

TheDrummer_Cydonia-24B-v2-GGUF

NaNK
10,895
16

mistralai_Ministral-3-14B-Reasoning-2512-GGUF

NaNK
license:apache-2.0
10,822
6

Qwen_Qwen3-Next-80B-A3B-Thinking-GGUF

NaNK
license:apache-2.0
10,641
14

Llama-3.3-70B-Instruct-GGUF

NaNK
llama
10,530
64

Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF

Llamacpp imatrix Quantizations of Qwen3-30B-A3B-Instruct-2507 by Qwen Original model: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-30B-A3B-Instruct-2507-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Qwen3-30B-A3B-Instruct-2507-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-30B-A3B-Instruct-2507-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-30B-A3B-Instruct-2507-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-30B-A3B-Instruct-2507-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-30B-A3B-Instruct-2507-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-30B-A3B-Instruct-2507-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-30B-A3B-Instruct-2507-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Qwen3-30B-A3B-Instruct-2507-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-30B-A3B-Instruct-2507-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Qwen3-30B-A3B-Instruct-2507-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-30B-A3B-Instruct-2507-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-30B-A3B-Instruct-2507-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-30B-A3B-Instruct-2507-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-30B-A3B-Instruct-2507-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
10,519
21

zai-org_GLM-4.6-GGUF

Llamacpp imatrix Quantizations of GLM-4.6 by zai-org Original model: https://huggingface.co/zai-org/GLM-4.6 All quants made using imatrix option with dataset from here combined with a subset of com...

NaNK
10,063
17

DeepSeek-R1-Distill-Qwen-32B-GGUF

Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-32B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-32B-bf16.gguf | bf16 | 65.54GB | true | Full BF16 weights. | | DeepSeek-R1-Distill-Qwen-32B-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-32B-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-32B-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q40.gguf | Q40 | 18.71GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-32B-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-32B-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-32B-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-32B-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-32B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
9,963
283

DeepSeek-Coder-V2-Lite-Instruct-GGUF

9,727
120

nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF

NaNK
9,695
21

Qwen2.5-Coder-7B-Instruct-GGUF

NaNK
license:apache-2.0
9,372
30

huihui-ai_Huihui-gpt-oss-20b-BF16-abliterated-GGUF

NaNK
9,175
37

Phi-3.1-mini-128k-instruct-GGUF

license:mit
8,967
35

granite-embedding-107m-multilingual-GGUF

license:apache-2.0
8,898
1

cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF

NaNK
8,632
11

thesby_Qwen2.5-VL-7B-NSFW-Caption-V3-GGUF

NaNK
license:apache-2.0
8,357
28

mistralai_Mistral-Small-4-119B-2603-GGUF

NaNK
license:apache-2.0
8,223
9

Qwen2.5-14B_Uncensored_Instruct-GGUF

NaNK
license:apache-2.0
8,050
55

mistral-community_pixtral-12b-GGUF

NaNK
license:apache-2.0
7,750
12

Hermes-3-Llama-3.2-3B-GGUF

NaNK
Llama-3
7,567
11

moonshotai_Kimi-K2.5-GGUF

NaNK
7,453
7

Ministral-8B-Instruct-2410-GGUF

NaNK
7,337
52

L3-8B-Stheno-v3.2-GGUF

Llamacpp imatrix Quantizations of L3-8B-Stheno-v3.2 Original model: https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2 All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Description | | -------- | ---------- | --------- | ----------- | | L3-8B-Stheno-v3.2-Q80.gguf | Q80 | 8.54GB | Extremely high quality, generally unneeded but max available quant. | | L3-8B-Stheno-v3.2-Q6K.gguf | Q6K | 6.59GB | Very high quality, near perfect, recommended. | | L3-8B-Stheno-v3.2-Q5KM.gguf | Q5KM | 5.73GB | High quality, recommended. | | L3-8B-Stheno-v3.2-Q5KS.gguf | Q5KS | 5.59GB | High quality, recommended. | | L3-8B-Stheno-v3.2-Q4KM.gguf | Q4KM | 4.92GB | Good quality, uses about 4.83 bits per weight, recommended. | | L3-8B-Stheno-v3.2-Q4KS.gguf | Q4KS | 4.69GB | Slightly lower quality with more space savings, recommended. | | L3-8B-Stheno-v3.2-IQ4XS.gguf | IQ4XS | 4.44GB | Decent quality, smaller than Q4KS with similar performance, recommended. | | L3-8B-Stheno-v3.2-Q3KL.gguf | Q3KL | 4.32GB | Lower quality but usable, good for low RAM availability. | | L3-8B-Stheno-v3.2-Q3KM.gguf | Q3KM | 4.01GB | Even lower quality. | | L3-8B-Stheno-v3.2-IQ3M.gguf | IQ3M | 3.78GB | Medium-low quality, new method with decent performance comparable to Q3KM. | | L3-8B-Stheno-v3.2-Q3KS.gguf | Q3KS | 3.66GB | Low quality, not recommended. | | L3-8B-Stheno-v3.2-IQ3XS.gguf | IQ3XS | 3.51GB | Lower quality, new method with decent performance, slightly better than Q3KS. | | L3-8B-Stheno-v3.2-IQ3XXS.gguf | IQ3XXS | 3.27GB | Lower quality, new method with decent performance, comparable to Q3 quants. | | L3-8B-Stheno-v3.2-Q2K.gguf | Q2K | 3.17GB | Very low quality but surprisingly usable. | | L3-8B-Stheno-v3.2-IQ2M.gguf | IQ2M | 2.94GB | Very low quality, uses SOTA techniques to also be surprisingly usable. | | L3-8B-Stheno-v3.2-IQ2S.gguf | IQ2S | 2.75GB | Very low quality, uses SOTA techniques to be usable. | | L3-8B-Stheno-v3.2-IQ2XS.gguf | IQ2XS | 2.60GB | Very low quality, uses SOTA techniques to be usable. | First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (L3-8B-Stheno-v3.2-Q80) or download them all in place (./) A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:cc-by-nc-4.0
7,220
30

Qwen_Qwen3-30B-A3B-Thinking-2507-GGUF

NaNK
7,153
12

rwkv-6-world-7b-GGUF

NaNK
6,857
5

Athene-V2-Chat-GGUF

6,809
21

mistralai_Magistral-Small-2509-GGUF

Llamacpp imatrix Quantizations of Magistral-Small-2509 by mistralai Original model: https://huggingface.co/mistralai/Magistral-Small-2509 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Magistral-Small-2509-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Magistral-Small-2509-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Magistral-Small-2509-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Magistral-Small-2509-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Magistral-Small-2509-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Magistral-Small-2509-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Magistral-Small-2509-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Magistral-Small-2509-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Magistral-Small-2509-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Magistral-Small-2509-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Magistral-Small-2509-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Magistral-Small-2509-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Magistral-Small-2509-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Magistral-Small-2509-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Magistral-Small-2509-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Magistral-Small-2509-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Magistral-Small-2509-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Magistral-Small-2509-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Magistral-Small-2509-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Magistral-Small-2509-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Magistral-Small-2509-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Magistral-Small-2509-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Magistral-Small-2509-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Magistral-Small-2509-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Magistral-Small-2509-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Magistral-Small-2509-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | | Magistral-Small-2509-IQ2XXS.gguf | IQ2XXS | 6.55GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mistralaiMagistral-Small-2509-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

6,734
8

google_gemma-3n-E4B-it-GGUF

NaNK
6,721
16

mlabonne_gemma-3-27b-it-abliterated-GGUF

NaNK
6,716
38

PocketDoc_Dans-PersonalityEngine-V1.3.0-24b-GGUF

Llamacpp imatrix Quantizations of Dans-PersonalityEngine-V1.3.0-24b by PocketDoc Original model: https://huggingface.co/PocketDoc/Dans-PersonalityEngine-V1.3.0-24b All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Dans-PersonalityEngine-V1.3.0-24b-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Dans-PersonalityEngine-V1.3.0-24b-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Dans-PersonalityEngine-V1.3.0-24b-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Dans-PersonalityEngine-V1.3.0-24b-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Dans-PersonalityEngine-V1.3.0-24b-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Dans-PersonalityEngine-V1.3.0-24b-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Dans-PersonalityEngine-V1.3.0-24b-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Dans-PersonalityEngine-V1.3.0-24b-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Dans-PersonalityEngine-V1.3.0-24b-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Dans-PersonalityEngine-V1.3.0-24b-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (PocketDocDans-PersonalityEngine-V1.3.0-24b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:apache-2.0
6,445
32

ServiceNow-AI_Apriel-1.5-15b-Thinker-GGUF

Llamacpp imatrix Quantizations of Apriel-1.5-15b-Thinker by ServiceNow-AI Original model: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Apriel-1.5-15b-Thinker-bf16.gguf | bf16 | 28.87GB | false | Full BF16 weights. | | Apriel-1.5-15b-Thinker-Q80.gguf | Q80 | 15.34GB | false | Extremely high quality, generally unneeded but max available quant. | | Apriel-1.5-15b-Thinker-Q6KL.gguf | Q6KL | 12.17GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Apriel-1.5-15b-Thinker-Q6K.gguf | Q6K | 11.85GB | false | Very high quality, near perfect, recommended. | | Apriel-1.5-15b-Thinker-Q5KL.gguf | Q5KL | 10.68GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Apriel-1.5-15b-Thinker-Q5KM.gguf | Q5KM | 10.27GB | false | High quality, recommended. | | Apriel-1.5-15b-Thinker-Q5KS.gguf | Q5KS | 10.02GB | false | High quality, recommended. | | Apriel-1.5-15b-Thinker-Q4KL.gguf | Q4KL | 9.28GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Apriel-1.5-15b-Thinker-Q41.gguf | Q41 | 9.16GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Apriel-1.5-15b-Thinker-Q4KM.gguf | Q4KM | 8.79GB | false | Good quality, default size for most use cases, recommended. | | Apriel-1.5-15b-Thinker-Q4KS.gguf | Q4KS | 8.36GB | false | Slightly lower quality with more space savings, recommended. | | Apriel-1.5-15b-Thinker-Q40.gguf | Q40 | 8.33GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Apriel-1.5-15b-Thinker-IQ4NL.gguf | IQ4NL | 8.33GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Apriel-1.5-15b-Thinker-Q3KXL.gguf | Q3KXL | 8.29GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Apriel-1.5-15b-Thinker-IQ4XS.gguf | IQ4XS | 7.91GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Apriel-1.5-15b-Thinker-Q3KL.gguf | Q3KL | 7.70GB | false | Lower quality but usable, good for low RAM availability. | | Apriel-1.5-15b-Thinker-Q3KM.gguf | Q3KM | 7.14GB | false | Low quality. | | Apriel-1.5-15b-Thinker-IQ3M.gguf | IQ3M | 6.70GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Apriel-1.5-15b-Thinker-Q3KS.gguf | Q3KS | 6.47GB | false | Low quality, not recommended. | | Apriel-1.5-15b-Thinker-Q2KL.gguf | Q2KL | 6.25GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Apriel-1.5-15b-Thinker-IQ3XS.gguf | IQ3XS | 6.20GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Apriel-1.5-15b-Thinker-IQ3XXS.gguf | IQ3XXS | 5.78GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Apriel-1.5-15b-Thinker-Q2K.gguf | Q2K | 5.59GB | false | Very low quality but surprisingly usable. | | Apriel-1.5-15b-Thinker-IQ2M.gguf | IQ2M | 5.17GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Apriel-1.5-15b-Thinker-IQ2S.gguf | IQ2S | 4.81GB | false | Low quality, uses SOTA techniques to be usable. | | Apriel-1.5-15b-Thinker-IQ2XS.gguf | IQ2XS | 4.56GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ServiceNow-AIApriel-1.5-15b-Thinker-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
6,419
6

Qwen_Qwen3-30B-A3B-GGUF

NaNK
license:apache-2.0
6,360
58

google_gemma-3-27b-it-GGUF

NaNK
6,288
61

THUDM_GLM-Z1-32B-0414-GGUF

Llamacpp imatrix Quantizations of GLM-Z1-32B-0414 by THUDM Original model: https://huggingface.co/THUDM/GLM-Z1-32B-0414 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | GLM-Z1-32B-0414-bf16.gguf | bf16 | 65.14GB | true | Full BF16 weights. | | GLM-Z1-32B-0414-Q80.gguf | Q80 | 34.62GB | false | Extremely high quality, generally unneeded but max available quant. | | GLM-Z1-32B-0414-Q6KL.gguf | Q6KL | 27.18GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | GLM-Z1-32B-0414-Q6K.gguf | Q6K | 26.73GB | false | Very high quality, near perfect, recommended. | | GLM-Z1-32B-0414-Q5KL.gguf | Q5KL | 23.67GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | GLM-Z1-32B-0414-Q5KM.gguf | Q5KM | 23.10GB | false | High quality, recommended. | | GLM-Z1-32B-0414-Q5KS.gguf | Q5KS | 22.53GB | false | High quality, recommended. | | GLM-Z1-32B-0414-Q41.gguf | Q41 | 20.55GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | GLM-Z1-32B-0414-Q4KL.gguf | Q4KL | 20.37GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | GLM-Z1-32B-0414-Q4KM.gguf | Q4KM | 19.68GB | false | Good quality, default size for most use cases, recommended. | | GLM-Z1-32B-0414-Q4KS.gguf | Q4KS | 18.70GB | false | Slightly lower quality with more space savings, recommended. | | GLM-Z1-32B-0414-Q40.gguf | Q40 | 18.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | GLM-Z1-32B-0414-IQ4NL.gguf | IQ4NL | 18.58GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | GLM-Z1-32B-0414-Q3KXL.gguf | Q3KXL | 18.03GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | GLM-Z1-32B-0414-IQ4XS.gguf | IQ4XS | 17.60GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | GLM-Z1-32B-0414-Q3KL.gguf | Q3KL | 17.22GB | false | Lower quality but usable, good for low RAM availability. | | GLM-Z1-32B-0414-Q3KM.gguf | Q3KM | 15.89GB | false | Low quality. | | GLM-Z1-32B-0414-IQ3M.gguf | IQ3M | 14.82GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | GLM-Z1-32B-0414-Q3KS.gguf | Q3KS | 14.37GB | false | Low quality, not recommended. | | GLM-Z1-32B-0414-IQ3XS.gguf | IQ3XS | 13.66GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | GLM-Z1-32B-0414-Q2KL.gguf | Q2KL | 13.20GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | GLM-Z1-32B-0414-IQ3XXS.gguf | IQ3XXS | 12.78GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | GLM-Z1-32B-0414-Q2K.gguf | Q2K | 12.29GB | false | Very low quality but surprisingly usable. | | GLM-Z1-32B-0414-IQ2M.gguf | IQ2M | 11.27GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | GLM-Z1-32B-0414-IQ2S.gguf | IQ2S | 10.42GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-Z1-32B-0414-IQ2XS.gguf | IQ2XS | 9.90GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (THUDMGLM-Z1-32B-0414-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:mit
6,220
16

THUDM_GLM-4-9B-0414-GGUF

NaNK
license:mit
6,173
19

Qwen2.5-Math-7B-Instruct-GGUF

NaNK
license:apache-2.0
6,133
12

zai-org_GLM-4.6V-Flash-GGUF

license:mit
6,081
11

NousResearch_Hermes-4-14B-GGUF

Llamacpp imatrix Quantizations of Hermes-4-14B by NousResearch Original model: https://huggingface.co/NousResearch/Hermes-4-14B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Hermes-4-14B-bf16.gguf | bf16 | 29.54GB | false | Full BF16 weights. | | Hermes-4-14B-Q80.gguf | Q80 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. | | Hermes-4-14B-Q6KL.gguf | Q6KL | 12.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Hermes-4-14B-Q6K.gguf | Q6K | 12.12GB | false | Very high quality, near perfect, recommended. | | Hermes-4-14B-Q5KL.gguf | Q5KL | 10.99GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Hermes-4-14B-Q5KM.gguf | Q5KM | 10.51GB | false | High quality, recommended. | | Hermes-4-14B-Q5KS.gguf | Q5KS | 10.26GB | false | High quality, recommended. | | Hermes-4-14B-Q4KL.gguf | Q4KL | 9.58GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Hermes-4-14B-Q41.gguf | Q41 | 9.39GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Hermes-4-14B-Q4KM.gguf | Q4KM | 9.00GB | false | Good quality, default size for most use cases, recommended. | | Hermes-4-14B-Q3KXL.gguf | Q3KXL | 8.58GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Hermes-4-14B-Q4KS.gguf | Q4KS | 8.57GB | false | Slightly lower quality with more space savings, recommended. | | Hermes-4-14B-Q40.gguf | Q40 | 8.54GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Hermes-4-14B-IQ4NL.gguf | IQ4NL | 8.54GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Hermes-4-14B-IQ4XS.gguf | IQ4XS | 8.11GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Hermes-4-14B-Q3KL.gguf | Q3KL | 7.90GB | false | Lower quality but usable, good for low RAM availability. | | Hermes-4-14B-Q3KM.gguf | Q3KM | 7.32GB | false | Low quality. | | Hermes-4-14B-IQ3M.gguf | IQ3M | 6.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Hermes-4-14B-Q3KS.gguf | Q3KS | 6.66GB | false | Low quality, not recommended. | | Hermes-4-14B-Q2KL.gguf | Q2KL | 6.51GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Hermes-4-14B-IQ3XS.gguf | IQ3XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Hermes-4-14B-IQ3XXS.gguf | IQ3XXS | 5.94GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Hermes-4-14B-Q2K.gguf | Q2K | 5.75GB | false | Very low quality but surprisingly usable. | | Hermes-4-14B-IQ2M.gguf | IQ2M | 5.32GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Hermes-4-14B-IQ2S.gguf | IQ2S | 4.96GB | false | Low quality, uses SOTA techniques to be usable. | | Hermes-4-14B-IQ2XS.gguf | IQ2XS | 4.69GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (NousResearchHermes-4-14B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
6,021
15

dolphin-2.9-llama3-8b-GGUF

NaNK
base_model:dphn/dolphin-2.9-llama3-8b
5,988
8

gemma-2-27b-it-GGUF

NaNK
5,884
170

mistralai_Ministral-3-14B-Instruct-2512-GGUF

NaNK
license:apache-2.0
5,728
0

TheDrummer_Rivermind-24B-v1-GGUF

NaNK
5,701
1

trashpanda-org_QwQ-32B-Snowdrop-v0-GGUF

NaNK
5,680
14

THUDM_GLM-4-32B-0414-GGUF

NaNK
license:mit
5,594
100

Qwen_Qwen3-4B-Instruct-2507-GGUF

NaNK
5,518
17

Qwen2-VL-2B-Instruct-GGUF

NaNK
license:apache-2.0
5,512
29

L3-8B-Lunaris-v1-GGUF

NaNK
license:llama3
5,444
25

inclusionAI_Ling-flash-2.0-GGUF

Llamacpp imatrix Quantizations of Ling-flash-2.0 by inclusionAI Original model: https://huggingface.co/inclusionAI/Ling-flash-2.0 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Ling-flash-2.0-Q80.gguf | Q80 | 109.42GB | true | Extremely high quality, generally unneeded but max available quant. | | Ling-flash-2.0-Q6K.gguf | Q6K | 84.61GB | true | Very high quality, near perfect, recommended. | | Ling-flash-2.0-Q5KM.gguf | Q5KM | 73.32GB | true | High quality, recommended. | | Ling-flash-2.0-Q5KS.gguf | Q5KS | 71.03GB | true | High quality, recommended. | | Ling-flash-2.0-Q41.gguf | Q41 | 64.64GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Ling-flash-2.0-Q4KL.gguf | Q4KL | 63.10GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | Ling-flash-2.0-Q4KM.gguf | Q4KM | 62.62GB | true | Good quality, default size for most use cases, recommended. | | Ling-flash-2.0-Q4KS.gguf | Q4KS | 60.37GB | true | Slightly lower quality with more space savings, recommended. | | Ling-flash-2.0-Q40.gguf | Q40 | 59.29GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Ling-flash-2.0-IQ4NL.gguf | IQ4NL | 58.35GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Ling-flash-2.0-IQ4XS.gguf | IQ4XS | 55.18GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | Ling-flash-2.0-Q3KXL.gguf | Q3KXL | 49.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Ling-flash-2.0-Q3KL.gguf | Q3KL | 49.01GB | false | Lower quality but usable, good for low RAM availability. | | Ling-flash-2.0-Q3KM.gguf | Q3KM | 47.13GB | false | Low quality. | | Ling-flash-2.0-IQ3M.gguf | IQ3M | 47.13GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Ling-flash-2.0-Q3KS.gguf | Q3KS | 44.90GB | false | Low quality, not recommended. | | Ling-flash-2.0-IQ3XS.gguf | IQ3XS | 42.48GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Ling-flash-2.0-IQ3XXS.gguf | IQ3XXS | 40.85GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Ling-flash-2.0-Q2KL.gguf | Q2KL | 36.88GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Ling-flash-2.0-Q2K.gguf | Q2K | 36.25GB | false | Very low quality but surprisingly usable. | | Ling-flash-2.0-IQ2M.gguf | IQ2M | 32.41GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Ling-flash-2.0-IQ2S.gguf | IQ2S | 28.66GB | false | Low quality, uses SOTA techniques to be usable. | | Ling-flash-2.0-IQ2XS.gguf | IQ2XS | 28.53GB | false | Low quality, uses SOTA techniques to be usable. | | Ling-flash-2.0-IQ2XXS.gguf | IQ2XXS | 25.82GB | false | Very low quality, uses SOTA techniques to be usable. | | Ling-flash-2.0-IQ1M.gguf | IQ1M | 22.22GB | false | Extremely low quality, not recommended. | | Ling-flash-2.0-IQ1S.gguf | IQ1S | 21.45GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (inclusionAILing-flash-2.0-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
5,344
3

ServiceNow-AI_Apriel-1.6-15b-Thinker-GGUF

NaNK
license:mit
5,342
10

Qwen2-VL-7B-Instruct-GGUF

NaNK
license:apache-2.0
5,086
39

microsoft_UserLM-8b-GGUF

Llamacpp imatrix Quantizations of UserLM-8b by microsoft Original model: https://huggingface.co/microsoft/UserLM-8b All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | UserLM-8b-bf16.gguf | bf16 | 16.07GB | false | Full BF16 weights. | | UserLM-8b-Q80.gguf | Q80 | 8.54GB | false | Extremely high quality, generally unneeded but max available quant. | | UserLM-8b-Q6KL.gguf | Q6KL | 6.85GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | UserLM-8b-Q6K.gguf | Q6K | 6.60GB | false | Very high quality, near perfect, recommended. | | UserLM-8b-Q5KL.gguf | Q5KL | 6.06GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | UserLM-8b-Q5KM.gguf | Q5KM | 5.73GB | false | High quality, recommended. | | UserLM-8b-Q5KS.gguf | Q5KS | 5.60GB | false | High quality, recommended. | | UserLM-8b-Q4KL.gguf | Q4KL | 5.31GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | UserLM-8b-Q41.gguf | Q41 | 5.13GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | UserLM-8b-Q4KM.gguf | Q4KM | 4.92GB | false | Good quality, default size for most use cases, recommended. | | UserLM-8b-Q3KXL.gguf | Q3KXL | 4.78GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | UserLM-8b-Q4KS.gguf | Q4KS | 4.69GB | false | Slightly lower quality with more space savings, recommended. | | UserLM-8b-Q40.gguf | Q40 | 4.68GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | UserLM-8b-IQ4NL.gguf | IQ4NL | 4.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | UserLM-8b-IQ4XS.gguf | IQ4XS | 4.45GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | UserLM-8b-Q3KL.gguf | Q3KL | 4.32GB | false | Lower quality but usable, good for low RAM availability. | | UserLM-8b-Q3KM.gguf | Q3KM | 4.02GB | false | Low quality. | | UserLM-8b-IQ3M.gguf | IQ3M | 3.78GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | UserLM-8b-Q2KL.gguf | Q2KL | 3.69GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | UserLM-8b-Q3KS.gguf | Q3KS | 3.66GB | false | Low quality, not recommended. | | UserLM-8b-IQ3XS.gguf | IQ3XS | 3.52GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | UserLM-8b-IQ3XXS.gguf | IQ3XXS | 3.27GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | UserLM-8b-Q2K.gguf | Q2K | 3.18GB | false | Very low quality but surprisingly usable. | | UserLM-8b-IQ2M.gguf | IQ2M | 2.95GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (microsoftUserLM-8b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,990
2

nvidia_NVIDIA-Nemotron-Nano-12B-v2-GGUF

Llamacpp imatrix Quantizations of NVIDIA-Nemotron-Nano-12B-v2 by nvidia Original model: https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | NVIDIA-Nemotron-Nano-12B-v2-bf16.gguf | bf16 | 24.63GB | false | Full BF16 weights. | | NVIDIA-Nemotron-Nano-12B-v2-Q80.gguf | Q80 | 13.09GB | false | Extremely high quality, generally unneeded but max available quant. | | NVIDIA-Nemotron-Nano-12B-v2-Q6KL.gguf | Q6KL | 10.44GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q6K.gguf | Q6K | 10.11GB | false | Very high quality, near perfect, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q5KL.gguf | Q5KL | 9.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q5KM.gguf | Q5KM | 8.76GB | false | High quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q5KS.gguf | Q5KS | 8.57GB | false | High quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q4KL.gguf | Q4KL | 7.99GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q41.gguf | Q41 | 7.84GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | NVIDIA-Nemotron-Nano-12B-v2-Q4KM.gguf | Q4KM | 7.49GB | false | Good quality, default size for most use cases, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q4KS.gguf | Q4KS | 7.21GB | false | Slightly lower quality with more space savings, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q40.gguf | Q40 | 7.16GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | NVIDIA-Nemotron-Nano-12B-v2-IQ4NL.gguf | IQ4NL | 7.11GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KXL.gguf | Q3KXL | 6.96GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | NVIDIA-Nemotron-Nano-12B-v2-IQ4XS.gguf | IQ4XS | 6.75GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KL.gguf | Q3KL | 6.37GB | false | Lower quality but usable, good for low RAM availability. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KM.gguf | Q3KM | 6.02GB | false | Low quality. | | NVIDIA-Nemotron-Nano-12B-v2-IQ3M.gguf | IQ3M | 5.69GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KS.gguf | Q3KS | 5.57GB | false | Low quality, not recommended. | | NVIDIA-Nemotron-Nano-12B-v2-IQ3XS.gguf | IQ3XS | 5.46GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | NVIDIA-Nemotron-Nano-12B-v2-Q2KL.gguf | Q2KL | 5.36GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | NVIDIA-Nemotron-Nano-12B-v2-IQ3XXS.gguf | IQ3XXS | 4.96GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | NVIDIA-Nemotron-Nano-12B-v2-Q2K.gguf | Q2K | 4.70GB | false | Very low quality but surprisingly usable. | | NVIDIA-Nemotron-Nano-12B-v2-IQ2M.gguf | IQ2M | 4.38GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | NVIDIA-Nemotron-Nano-12B-v2-IQ2S.gguf | IQ2S | 4.07GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (nvidiaNVIDIA-Nemotron-Nano-12B-v2-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,864
12

huihui-ai_Huihui-gemma-3n-E4B-it-abliterated-GGUF

NaNK
4,822
15

TheDrummer_Skyfall-31B-v4-GGUF

Llamacpp imatrix Quantizations of Skyfall-31B-v4 by TheDrummer Original model: https://huggingface.co/TheDrummer/Skyfall-31B-v4 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Skyfall-31B-v4-bf16.gguf | bf16 | 62.71GB | true | Full BF16 weights. | | Skyfall-31B-v4-Q80.gguf | Q80 | 33.32GB | false | Extremely high quality, generally unneeded but max available quant. | | Skyfall-31B-v4-Q6KL.gguf | Q6KL | 26.05GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Skyfall-31B-v4-Q6K.gguf | Q6K | 25.73GB | false | Very high quality, near perfect, recommended. | | Skyfall-31B-v4-Q5KL.gguf | Q5KL | 22.67GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Skyfall-31B-v4-Q5KM.gguf | Q5KM | 22.25GB | false | High quality, recommended. | | Skyfall-31B-v4-Q5KS.gguf | Q5KS | 21.65GB | false | High quality, recommended. | | Skyfall-31B-v4-Q41.gguf | Q41 | 19.74GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Skyfall-31B-v4-Q4KL.gguf | Q4KL | 19.48GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Skyfall-31B-v4-Q4KM.gguf | Q4KM | 18.98GB | false | Good quality, default size for most use cases, recommended. | | Skyfall-31B-v4-Q4KS.gguf | Q4KS | 17.95GB | false | Slightly lower quality with more space savings, recommended. | | Skyfall-31B-v4-Q40.gguf | Q40 | 17.88GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Skyfall-31B-v4-IQ4NL.gguf | IQ4NL | 17.85GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Skyfall-31B-v4-Q3KXL.gguf | Q3KXL | 17.03GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Skyfall-31B-v4-IQ4XS.gguf | IQ4XS | 16.90GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Skyfall-31B-v4-Q3KL.gguf | Q3KL | 16.44GB | false | Lower quality but usable, good for low RAM availability. | | Skyfall-31B-v4-Q3KM.gguf | Q3KM | 15.20GB | false | Low quality. | | Skyfall-31B-v4-IQ3M.gguf | IQ3M | 14.07GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Skyfall-31B-v4-Q3KS.gguf | Q3KS | 13.74GB | false | Low quality, not recommended. | | Skyfall-31B-v4-IQ3XS.gguf | IQ3XS | 13.07GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Skyfall-31B-v4-Q2KL.gguf | Q2KL | 12.38GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Skyfall-31B-v4-IQ3XXS.gguf | IQ3XXS | 12.26GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Skyfall-31B-v4-Q2K.gguf | Q2K | 11.73GB | false | Very low quality but surprisingly usable. | | Skyfall-31B-v4-IQ2M.gguf | IQ2M | 10.68GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Skyfall-31B-v4-IQ2S.gguf | IQ2S | 9.81GB | false | Low quality, uses SOTA techniques to be usable. | | Skyfall-31B-v4-IQ2XS.gguf | IQ2XS | 9.48GB | false | Low quality, uses SOTA techniques to be usable. | | Skyfall-31B-v4-IQ2XXS.gguf | IQ2XXS | 8.59GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerSkyfall-31B-v4-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,699
6

Mistral-Small-Instruct-2409-GGUF

4,695
53

deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF

NaNK
license:mit
4,668
45

allenai_olmOCR-2-7B-1025-GGUF

NaNK
4,643
3

TheDrummer_Cydonia-Redux-22B-v1.1-GGUF

Llamacpp imatrix Quantizations of Cydonia-Redux-22B-v1.1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Cydonia-Redux-22B-v1.1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Cydonia-Redux-22B-v1.1-bf16.gguf | bf16 | 44.50GB | false | Full BF16 weights. | | Cydonia-Redux-22B-v1.1-Q80.gguf | Q80 | 23.64GB | false | Extremely high quality, generally unneeded but max available quant. | | Cydonia-Redux-22B-v1.1-Q6KL.gguf | Q6KL | 18.35GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Cydonia-Redux-22B-v1.1-Q6K.gguf | Q6K | 18.25GB | false | Very high quality, near perfect, recommended. | | Cydonia-Redux-22B-v1.1-Q5KL.gguf | Q5KL | 15.85GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Cydonia-Redux-22B-v1.1-Q5KM.gguf | Q5KM | 15.72GB | false | High quality, recommended. | | Cydonia-Redux-22B-v1.1-Q5KS.gguf | Q5KS | 15.32GB | false | High quality, recommended. | | Cydonia-Redux-22B-v1.1-Q41.gguf | Q41 | 13.95GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Cydonia-Redux-22B-v1.1-Q4KL.gguf | Q4KL | 13.49GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Cydonia-Redux-22B-v1.1-Q4KM.gguf | Q4KM | 13.34GB | false | Good quality, default size for most use cases, recommended. | | Cydonia-Redux-22B-v1.1-Q4KS.gguf | Q4KS | 12.66GB | false | Slightly lower quality with more space savings, recommended. | | Cydonia-Redux-22B-v1.1-Q40.gguf | Q40 | 12.61GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Cydonia-Redux-22B-v1.1-IQ4NL.gguf | IQ4NL | 12.61GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Cydonia-Redux-22B-v1.1-IQ4XS.gguf | IQ4XS | 11.94GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Cydonia-Redux-22B-v1.1-Q3KXL.gguf | Q3KXL | 11.91GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Cydonia-Redux-22B-v1.1-Q3KL.gguf | Q3KL | 11.73GB | false | Lower quality but usable, good for low RAM availability. | | Cydonia-Redux-22B-v1.1-Q3KM.gguf | Q3KM | 10.76GB | false | Low quality. | | Cydonia-Redux-22B-v1.1-IQ3M.gguf | IQ3M | 10.06GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Cydonia-Redux-22B-v1.1-Q3KS.gguf | Q3KS | 9.64GB | false | Low quality, not recommended. | | Cydonia-Redux-22B-v1.1-IQ3XS.gguf | IQ3XS | 9.18GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Cydonia-Redux-22B-v1.1-IQ3XXS.gguf | IQ3XXS | 8.60GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Cydonia-Redux-22B-v1.1-Q2KL.gguf | Q2KL | 8.47GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Cydonia-Redux-22B-v1.1-Q2K.gguf | Q2K | 8.27GB | false | Very low quality but surprisingly usable. | | Cydonia-Redux-22B-v1.1-IQ2M.gguf | IQ2M | 7.62GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Cydonia-Redux-22B-v1.1-IQ2S.gguf | IQ2S | 7.04GB | false | Low quality, uses SOTA techniques to be usable. | | Cydonia-Redux-22B-v1.1-IQ2XS.gguf | IQ2XS | 6.65GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerCydonia-Redux-22B-v1.1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,603
5

Qwen_QwQ-32B-GGUF

Original model: https://huggingface.co/Qwen/QwQ-32B All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | QwQ-32B-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | QwQ-32B-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | QwQ-32B-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | QwQ-32B-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | QwQ-32B-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | QwQ-32B-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | QwQ-32B-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | QwQ-32B-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | QwQ-32B-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | QwQ-32B-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | QwQ-32B-Q40.gguf | Q40 | 18.71GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | QwQ-32B-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | QwQ-32B-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | QwQ-32B-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | QwQ-32B-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | QwQ-32B-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | QwQ-32B-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | QwQ-32B-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | QwQ-32B-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | QwQ-32B-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | QwQ-32B-IQ3XXS.gguf | IQ3XXS | 12.84GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | QwQ-32B-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | QwQ-32B-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | QwQ-32B-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | QwQ-32B-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | QwQ-32B-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwQ-32B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
license:apache-2.0
4,600
166

MiniMaxAI_MiniMax-M2.7-GGUF

NaNK
4,523
6

tencent_Hunyuan-7B-Instruct-GGUF

NaNK
4,471
5

gemma-2-9b-it-abliterated-GGUF

NaNK
4,457
43

mlabonne_gemma-3-4b-it-abliterated-GGUF

NaNK
4,455
10

cognitivecomputations_Dolphin3.0-R1-Mistral-24B-GGUF

NaNK
4,439
75

Qwen2.5-32B-Instruct-GGUF

NaNK
license:apache-2.0
4,431
61

ai21labs_AI21-Jamba-Reasoning-3B-GGUF

NaNK
4,418
5

google_gemma-3-12b-it-GGUF

NaNK
4,394
41

QwQ-32B-Preview-GGUF

NaNK
license:apache-2.0
4,392
101

tencent_Hunyuan-1.8B-Instruct-GGUF

NaNK
4,380
5

huizimao_gpt-oss-120b-uncensored-bf16-GGUF

NaNK
4,340
13

Yi-1.5-9B-Chat-GGUF

NaNK
license:apache-2.0
4,318
9

TheDrummer_Tiger-Gemma-12B-v3-GGUF

NaNK
4,238
11

aya-expanse-32b-GGUF

NaNK
license:cc-by-nc-4.0
4,190
23

mistralai_Ministral-3-3B-Instruct-2512-GGUF

NaNK
license:apache-2.0
4,148
3

Qwen_Qwen3-0.6B-GGUF

NaNK
4,145
19

openai_gpt-oss-120b-GGUF

Llamacpp imatrix Quantizations of gpt-oss-120b by openai Original model: https://huggingface.co/openai/gpt-oss-120b All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-120b-MXFP4MOE.gguf | MXFP4MOE | 63.39GB | true | Special format for OpenAI's gpt-oss models, see: https://github.com/ggml-org/llama.cpp/pull/15091 | The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything. The rest of these are provided for your own interest in case you feel like experimenting, but the size savings is basically non-existent so I would not recommend running them, they are provided simply for show: | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-120b-bf16.gguf | bf16 | 65.37GB | true | Full BF16 weights. | | gpt-oss-120b-Q6K.gguf | Q6K | 63.28GB | true | Q6K with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q4KL.gguf | Q4KL | 63.06GB | true | Uses Q80 for embed and output weights. Q4KM with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q2KL.gguf | Q2KL | 63.00GB | true | Uses Q80 for embed and output weights. Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KXL.gguf | Q3KXL | 62.89GB | true | Uses Q80 for embed and output weights. Q3KL with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q4KM.gguf | Q4KM | 62.84GB | true | Q4KM with all FFN kept at MXFP4MOE | | gpt-oss-120b-Q41.gguf | Q41 | 62.74GB | true | Q41 with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ4NL.gguf | IQ4NL | 62.71GB | true | IQ4NL with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ4XS.gguf | IQ4XS | 62.71GB | true | IQ4XS with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KM.gguf | Q3KM | 62.71GB | true | Q3KM with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ3M.gguf | IQ3M | 62.71GB | true | IQ3M with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q2K.gguf | Q2K | 62.71GB | true | Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KS.gguf | Q3KS | 62.70GB | true | Q3KS with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ2M.gguf | IQ2M | 62.69GB | true | IQ2M with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q40.gguf | Q40 | 62.65GB | true | Q40 with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KL.gguf | Q3KL | 62.60GB | true | Q3KL with all FFN kept at MXFP4MOE. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (openaigpt-oss-120b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,141
8

MiniMaxAI_MiniMax-M2-GGUF

Llamacpp imatrix Quantizations of MiniMax-M2 by MiniMaxAI Original model: https://huggingface.co/MiniMaxAI/MiniMax-M2 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | MiniMax-M2-Q80.gguf | Q80 | 243.14GB | true | Extremely high quality, generally unneeded but max available quant. | | MiniMax-M2-Q6K.gguf | Q6K | 187.81GB | true | Very high quality, near perfect, recommended. | | MiniMax-M2-Q5KM.gguf | Q5KM | 162.38GB | true | High quality, recommended. | | MiniMax-M2-Q5KS.gguf | Q5KS | 157.55GB | true | High quality, recommended. | | MiniMax-M2-Q41.gguf | Q41 | 143.31GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | MiniMax-M2-Q4KM.gguf | Q4KM | 138.59GB | true | Good quality, default size for most use cases, recommended. | | MiniMax-M2-Q4KS.gguf | Q4KS | 133.75GB | true | Slightly lower quality with more space savings, recommended. | | MiniMax-M2-Q40.gguf | Q40 | 131.34GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | MiniMax-M2-IQ4NL.gguf | IQ4NL | 129.24GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | MiniMax-M2-IQ4XS.gguf | IQ4XS | 122.17GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | MiniMax-M2-Q3KXL.gguf | Q3KXL | 108.74GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | MiniMax-M2-Q3KL.gguf | Q3KL | 108.21GB | true | Lower quality but usable, good for low RAM availability. | | MiniMax-M2-Q3KM.gguf | Q3KM | 103.96GB | true | Low quality. | | MiniMax-M2-IQ3M.gguf | IQ3M | 103.95GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | MiniMax-M2-Q3KS.gguf | Q3KS | 99.12GB | true | Low quality, not recommended. | | MiniMax-M2-IQ3XS.gguf | IQ3XS | 93.76GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | MiniMax-M2-IQ3XXS.gguf | IQ3XXS | 90.10GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | MiniMax-M2-Q2KL.gguf | Q2KL | 80.42GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | MiniMax-M2-Q2K.gguf | Q2K | 79.82GB | true | Very low quality but surprisingly usable. | | MiniMax-M2-IQ2M.gguf | IQ2M | 72.00GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | MiniMax-M2-IQ2S.gguf | IQ2S | 63.35GB | true | Low quality, uses SOTA techniques to be usable. | | MiniMax-M2-IQ2XS.gguf | IQ2XS | 63.14GB | true | Low quality, uses SOTA techniques to be usable. | | MiniMax-M2-IQ2XXS.gguf | IQ2XXS | 54.73GB | true | Very low quality, uses SOTA techniques to be usable. | | MiniMax-M2-IQ1M.gguf | IQ1M | 49.02GB | false | Extremely low quality, not recommended. | | MiniMax-M2-IQ1S.gguf | IQ1S | 47.01GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (MiniMaxAIMiniMax-M2-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,128
7

mistralai_Voxtral-Mini-3B-2507-GGUF

Llamacpp imatrix Quantizations of Voxtral-Mini-3B-2507 by mistralai Original model: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Voxtral-Mini-3B-2507-bf16.gguf | bf16 | 8.04GB | false | Full BF16 weights. | | Voxtral-Mini-3B-2507-Q80.gguf | Q80 | 4.27GB | false | Extremely high quality, generally unneeded but max available quant. | | Voxtral-Mini-3B-2507-Q6KL.gguf | Q6KL | 3.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Voxtral-Mini-3B-2507-Q6K.gguf | Q6K | 3.30GB | false | Very high quality, near perfect, recommended. | | Voxtral-Mini-3B-2507-Q5KL.gguf | Q5KL | 3.12GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Voxtral-Mini-3B-2507-Q5KM.gguf | Q5KM | 2.87GB | false | High quality, recommended. | | Voxtral-Mini-3B-2507-Q5KS.gguf | Q5KS | 2.82GB | false | High quality, recommended. | | Voxtral-Mini-3B-2507-Q4KL.gguf | Q4KL | 2.77GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Voxtral-Mini-3B-2507-Q41.gguf | Q41 | 2.60GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Voxtral-Mini-3B-2507-Q3KXL.gguf | Q3KXL | 2.56GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Voxtral-Mini-3B-2507-Q4KM.gguf | Q4KM | 2.47GB | false | Good quality, default size for most use cases, recommended. | | Voxtral-Mini-3B-2507-Q4KS.gguf | Q4KS | 2.38GB | false | Slightly lower quality with more space savings, recommended. | | Voxtral-Mini-3B-2507-Q40.gguf | Q40 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Voxtral-Mini-3B-2507-IQ4NL.gguf | IQ4NL | 2.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Voxtral-Mini-3B-2507-IQ4XS.gguf | IQ4XS | 2.27GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Voxtral-Mini-3B-2507-Q3KL.gguf | Q3KL | 2.21GB | false | Lower quality but usable, good for low RAM availability. | | Voxtral-Mini-3B-2507-Q3KM.gguf | Q3KM | 2.06GB | false | Low quality. | | Voxtral-Mini-3B-2507-Q2KL.gguf | Q2KL | 2.05GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Voxtral-Mini-3B-2507-IQ3M.gguf | IQ3M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Voxtral-Mini-3B-2507-Q3KS.gguf | Q3KS | 1.89GB | false | Low quality, not recommended. | | Voxtral-Mini-3B-2507-IQ3XS.gguf | IQ3XS | 1.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Voxtral-Mini-3B-2507-IQ3XXS.gguf | IQ3XXS | 1.69GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Voxtral-Mini-3B-2507-Q2K.gguf | Q2K | 1.66GB | false | Very low quality but surprisingly usable. | | Voxtral-Mini-3B-2507-IQ2M.gguf | IQ2M | 1.56GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mistralaiVoxtral-Mini-3B-2507-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
4,119
10

Phi-3.1-mini-4k-instruct-GGUF

license:mit
4,108
41

Lexi-Llama-3-8B-Uncensored-GGUF

NaNK
llama3
4,107
39

Qwen_Qwen3-Next-80B-A3B-Instruct-GGUF

NaNK
license:apache-2.0
4,081
15

Mistral-Small-22B-ArliAI-RPMax-v1.1-GGUF

NaNK
4,046
25

OLMo-2-1124-13B-Instruct-GGUF

NaNK
license:apache-2.0
4,038
15

Goekdeniz-Guelmez_Josiefied-Qwen3-8B-abliterated-v1-GGUF

NaNK
4,031
27

Mistral-Small-24B-Instruct-2501-GGUF

NaNK
license:apache-2.0
3,952
109

THUDM_GLM-Z1-9B-0414-GGUF

NaNK
license:mit
3,912
12

mistralai_Ministral-3-3B-Reasoning-2512-GGUF

NaNK
license:apache-2.0
3,911
5

TheDrummer_GLM-Steam-106B-A12B-v1-GGUF

NaNK
3,909
11

TheDrummer_Cydonia-24B-v4-GGUF

NaNK
3,906
12

DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored-GGUF

NaNK
llama3
3,901
26

Phi-3-medium-128k-instruct-GGUF

license:mit
3,888
58

agentica-org_DeepScaleR-1.5B-Preview-GGUF

NaNK
dataset:KbsdJames/Omni-MATH
3,888
38

Mistral-7B-Instruct-v0.3-GGUF

NaNK
license:apache-2.0
3,861
25

Qwen2.5-Math-1.5B-Instruct-GGUF

NaNK
license:apache-2.0
3,820
1

Rocinante-12B-v1.1-GGUF

NaNK
3,800
11

Yi-1.5-6B-Chat-GGUF

NaNK
license:apache-2.0
3,759
3

mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF

NaNK
license:apache-2.0
3,726
133

Qwen2.5-Coder-3B-Instruct-GGUF

NaNK
3,710
16

dolphin-2.9.1-llama-3-70b-GGUF

NaNK
base_model:meta-llama/Meta-Llama-3-70B
3,685
6

ibm-granite_granite-4.0-h-small-GGUF

Llamacpp imatrix Quantizations of granite-4.0-h-small by ibm-granite Original model: https://huggingface.co/ibm-granite/granite-4.0-h-small All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | granite-4.0-h-small-bf16.gguf | bf16 | 64.45GB | true | Full BF16 weights. | | granite-4.0-h-small-Q80.gguf | Q80 | 34.26GB | false | Extremely high quality, generally unneeded but max available quant. | | granite-4.0-h-small-Q6KL.gguf | Q6KL | 26.75GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | granite-4.0-h-small-Q6K.gguf | Q6K | 26.65GB | false | Very high quality, near perfect, recommended. | | granite-4.0-h-small-Q5KL.gguf | Q5KL | 23.24GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | granite-4.0-h-small-Q5KM.gguf | Q5KM | 23.14GB | false | High quality, recommended. | | granite-4.0-h-small-Q5KS.gguf | Q5KS | 22.44GB | false | High quality, recommended. | | granite-4.0-h-small-Q41.gguf | Q41 | 20.46GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | granite-4.0-h-small-Q4KL.gguf | Q4KL | 19.85GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | granite-4.0-h-small-Q4KM.gguf | Q4KM | 19.75GB | false | Good quality, default size for most use cases, recommended. | | granite-4.0-h-small-Q4KS.gguf | Q4KS | 19.10GB | false | Slightly lower quality with more space savings, recommended. | | granite-4.0-h-small-Q40.gguf | Q40 | 18.80GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | granite-4.0-h-small-IQ4NL.gguf | IQ4NL | 18.53GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | granite-4.0-h-small-IQ4XS.gguf | IQ4XS | 17.56GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | granite-4.0-h-small-Q3KXL.gguf | Q3KXL | 15.67GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | granite-4.0-h-small-Q3KL.gguf | Q3KL | 15.57GB | false | Lower quality but usable, good for low RAM availability. | | granite-4.0-h-small-Q3KM.gguf | Q3KM | 15.02GB | false | Low quality. | | granite-4.0-h-small-IQ3M.gguf | IQ3M | 15.02GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | granite-4.0-h-small-Q3KS.gguf | Q3KS | 14.42GB | false | Low quality, not recommended. | | granite-4.0-h-small-IQ3XS.gguf | IQ3XS | 13.78GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | granite-4.0-h-small-IQ3XXS.gguf | IQ3XXS | 13.12GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | granite-4.0-h-small-Q2KL.gguf | Q2KL | 11.84GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | granite-4.0-h-small-Q2K.gguf | Q2K | 11.74GB | false | Very low quality but surprisingly usable. | | granite-4.0-h-small-IQ2M.gguf | IQ2M | 10.47GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | granite-4.0-h-small-IQ2S.gguf | IQ2S | 9.30GB | false | Low quality, uses SOTA techniques to be usable. | | granite-4.0-h-small-IQ2XS.gguf | IQ2XS | 9.29GB | false | Low quality, uses SOTA techniques to be usable. | | granite-4.0-h-small-IQ2XXS.gguf | IQ2XXS | 8.15GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ibm-granitegranite-4.0-h-small-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

3,624
3

deepseek-r1-qwen-2.5-32B-ablated-GGUF

NaNK
license:mit
3,609
84

microsoft_Phi-4-mini-instruct-GGUF

3,609
33

TheDrummer_Valkyrie-49B-v2-GGUF

NaNK
3,606
10

Qwen_Qwen3-VL-32B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-32B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-32B-Instruct-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | Qwen3-VL-32B-Instruct-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-32B-Instruct-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Instruct-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Instruct-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-32B-Instruct-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | Qwen3-VL-32B-Instruct-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen3-VL-32B-Instruct-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-32B-Instruct-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-32B-Instruct-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-32B-Instruct-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-32B-Instruct-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-32B-Instruct-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-32B-Instruct-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Instruct-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-32B-Instruct-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Instruct-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | Qwen3-VL-32B-Instruct-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-32B-Instruct-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen3-VL-32B-Instruct-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-32B-Instruct-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-32B-Instruct-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-32B-Instruct-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-32B-Instruct-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-32B-Instruct-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Instruct-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Instruct-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-32B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
3,564
1

EuroLLM-9B-Instruct-GGUF

NaNK
license:apache-2.0
3,549
21

Tesslate_Tessa-T1-3B-GGUF

NaNK
license:apache-2.0
3,477
4

mistralai_Ministral-3-8B-Reasoning-2512-GGUF

NaNK
license:apache-2.0
3,465
1

TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF

NaNK
3,455
12

TheBeagle-v2beta-32B-MGS-GGUF

NaNK
license:mit
3,419
4

Llama-3.1-Nemotron-70B-Instruct-HF-GGUF

NaNK
llama3.1
3,375
99

google_gemma-3-1b-it-GGUF

NaNK
3,350
12

Qwen2.5-1.5B-Instruct-GGUF

NaNK
license:apache-2.0
3,342
9

Phi-3.5-mini-instruct_Uncensored-GGUF

license:apache-2.0
3,332
50

Qwen2.5-0.5B-Instruct-GGUF

NaNK
license:apache-2.0
3,325
9

Kwaipilot_KAT-Dev-72B-Exp-GGUF

Llamacpp imatrix Quantizations of KAT-Dev-72B-Exp by Kwaipilot Original model: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp All quants made using imatrix option with dataset from here combined ...

NaNK
3,325
6

mlabonne_Qwen3-8B-abliterated-GGUF

NaNK
license:apache-2.0
3,290
11

HuggingFaceTB_SmolLM3-3B-GGUF

NaNK
license:apache-2.0
3,290
9

Qwen_Qwen3-VL-30B-A3B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-30B-A3B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-30B-A3B-Instruct-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Qwen3-VL-30B-A3B-Instruct-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-30B-A3B-Instruct-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-30B-A3B-Instruct-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-30B-A3B-Instruct-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-30B-A3B-Instruct-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Instruct-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Instruct-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Qwen3-VL-30B-A3B-Instruct-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-30B-A3B-Instruct-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Qwen3-VL-30B-A3B-Instruct-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-30B-A3B-Instruct-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-30B-A3B-Instruct-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Instruct-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-30B-A3B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
3,290
6

RekaAI_reka-flash-3-GGUF

NaNK
license:apache-2.0
3,284
46

Qwen2.5.1-Coder-7B-Instruct-GGUF

NaNK
license:apache-2.0
3,280
95

deepcogito_cogito-v1-preview-qwen-32B-GGUF

NaNK
license:apache-2.0
3,218
13

mistralai_Ministral-3-8B-Instruct-2512-GGUF

NaNK
license:apache-2.0
3,133
2

Ilya626_Cydonia_Vistral-GGUF

Llamacpp imatrix Quantizations of CydoniaVistral by Ilya626 Original model: https://huggingface.co/Ilya626/CydoniaVistral All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | CydoniaVistral-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | CydoniaVistral-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | CydoniaVistral-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | CydoniaVistral-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | CydoniaVistral-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | CydoniaVistral-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | CydoniaVistral-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | CydoniaVistral-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | CydoniaVistral-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | CydoniaVistral-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | CydoniaVistral-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | CydoniaVistral-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | CydoniaVistral-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | CydoniaVistral-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | CydoniaVistral-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | CydoniaVistral-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | CydoniaVistral-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | CydoniaVistral-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | CydoniaVistral-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | CydoniaVistral-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | CydoniaVistral-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | CydoniaVistral-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | CydoniaVistral-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | CydoniaVistral-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | CydoniaVistral-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | CydoniaVistral-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Ilya626CydoniaVistral-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

3,122
0

phi-4-GGUF

NaNK
license:mit
3,119
58

TheDrummer_Snowpiercer-15B-v3-GGUF

NaNK
3,117
3

Qwen_Qwen3-32B-GGUF

NaNK
license:apache-2.0
3,116
39

PokeeAI_pokee_research_7b-GGUF

Llamacpp imatrix Quantizations of pokeeresearch7b by PokeeAI Original model: https://huggingface.co/PokeeAI/pokeeresearch7b All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | pokeeresearch7b-bf16.gguf | bf16 | 15.24GB | false | Full BF16 weights. | | pokeeresearch7b-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | pokeeresearch7b-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | pokeeresearch7b-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | pokeeresearch7b-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | pokeeresearch7b-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | pokeeresearch7b-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | pokeeresearch7b-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | pokeeresearch7b-Q41.gguf | Q41 | 4.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | pokeeresearch7b-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for most use cases, recommended. | | pokeeresearch7b-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | pokeeresearch7b-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | pokeeresearch7b-Q40.gguf | Q40 | 4.44GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | pokeeresearch7b-IQ4NL.gguf | IQ4NL | 4.44GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | pokeeresearch7b-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | pokeeresearch7b-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | pokeeresearch7b-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | pokeeresearch7b-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | pokeeresearch7b-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | pokeeresearch7b-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | pokeeresearch7b-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | pokeeresearch7b-IQ3XXS.gguf | IQ3XXS | 3.11GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | pokeeresearch7b-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | pokeeresearch7b-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (PokeeAIpokeeresearch7b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
3,094
4

TheDrummer_Behemoth-X-123B-v2-GGUF

NaNK
3,093
6

OpenGVLab_InternVL3_5-30B-A3B-GGUF

NaNK
3,073
7

MN-12B-Lyra-v4-GGUF

NaNK
license:cc-by-nc-4.0
3,068
14

Qwen_Qwen3-VL-4B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-4B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-4B-Instruct-bf16.gguf | bf16 | 8.05GB | false | Full BF16 weights. | | Qwen3-VL-4B-Instruct-Q80.gguf | Q80 | 4.28GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-4B-Instruct-Q6KL.gguf | Q6KL | 3.40GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-4B-Instruct-Q6K.gguf | Q6K | 3.31GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-4B-Instruct-Q5KL.gguf | Q5KL | 2.98GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-4B-Instruct-Q5KM.gguf | Q5KM | 2.89GB | false | High quality, recommended. | | Qwen3-VL-4B-Instruct-Q5KS.gguf | Q5KS | 2.82GB | false | High quality, recommended. | | Qwen3-VL-4B-Instruct-Q41.gguf | Q41 | 2.60GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-4B-Instruct-Q4KL.gguf | Q4KL | 2.59GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-4B-Instruct-Q4KM.gguf | Q4KM | 2.50GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-4B-Instruct-Q4KS.gguf | Q4KS | 2.38GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-4B-Instruct-Q40.gguf | Q40 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-4B-Instruct-IQ4NL.gguf | IQ4NL | 2.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-4B-Instruct-Q3KXL.gguf | Q3KXL | 2.33GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-4B-Instruct-IQ4XS.gguf | IQ4XS | 2.27GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-4B-Instruct-Q3KL.gguf | Q3KL | 2.24GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-4B-Instruct-Q3KM.gguf | Q3KM | 2.08GB | false | Low quality. | | Qwen3-VL-4B-Instruct-IQ3M.gguf | IQ3M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-4B-Instruct-Q3KS.gguf | Q3KS | 1.89GB | false | Low quality, not recommended. | | Qwen3-VL-4B-Instruct-IQ3XS.gguf | IQ3XS | 1.81GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-4B-Instruct-Q2KL.gguf | Q2KL | 1.76GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-4B-Instruct-IQ3XXS.gguf | IQ3XXS | 1.67GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-4B-Instruct-Q2K.gguf | Q2K | 1.67GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-4B-Instruct-IQ2M.gguf | IQ2M | 1.51GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-4B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
3,067
2

mistralai_Devstral-Small-2507-GGUF

license:apache-2.0
3,032
9

Llama-3.3-70B-Instruct-abliterated-GGUF

NaNK
llama
3,031
17

Dolphin3.0-Llama3.1-8B-GGUF

NaNK
base_model:dphn/Dolphin3.0-Llama3.1-8B
3,022
12

aya-expanse-8b-GGUF

NaNK
license:cc-by-nc-4.0
3,006
44

internlm_JanusCoderV-7B-GGUF

Llamacpp imatrix Quantizations of JanusCoderV-7B by internlm Original model: https://huggingface.co/internlm/JanusCoderV-7B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | JanusCoderV-7B-bf16.gguf | bf16 | 15.24GB | false | Full BF16 weights. | | JanusCoderV-7B-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | JanusCoderV-7B-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | JanusCoderV-7B-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | JanusCoderV-7B-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | JanusCoderV-7B-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | JanusCoderV-7B-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | JanusCoderV-7B-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | JanusCoderV-7B-Q41.gguf | Q41 | 4.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | JanusCoderV-7B-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for most use cases, recommended. | | JanusCoderV-7B-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | JanusCoderV-7B-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | JanusCoderV-7B-Q40.gguf | Q40 | 4.44GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | JanusCoderV-7B-IQ4NL.gguf | IQ4NL | 4.44GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | JanusCoderV-7B-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | JanusCoderV-7B-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | JanusCoderV-7B-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | JanusCoderV-7B-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | JanusCoderV-7B-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | JanusCoderV-7B-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | JanusCoderV-7B-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | JanusCoderV-7B-IQ3XXS.gguf | IQ3XXS | 3.11GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | JanusCoderV-7B-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | JanusCoderV-7B-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (internlmJanusCoderV-7B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,966
1

Trappu_Magnum-Picaro-0.7-v2-12b-GGUF

NaNK
license:apache-2.0
2,960
4

Meta-Llama-3-70B-Instruct-GGUF

NaNK
llama
2,857
50

internlm_JanusCoder-14B-GGUF

NaNK
2,832
4

nvidia_Qwen3-Nemotron-32B-RLBFF-GGUF

Llamacpp imatrix Quantizations of Qwen3-Nemotron-32B-RLBFF by nvidia Original model: https://huggingface.co/nvidia/Qwen3-Nemotron-32B-RLBFF All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-Nemotron-32B-RLBFF-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | Qwen3-Nemotron-32B-RLBFF-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-Nemotron-32B-RLBFF-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-Nemotron-32B-RLBFF-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-Nemotron-32B-RLBFF-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-Nemotron-32B-RLBFF-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-Nemotron-32B-RLBFF-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-Nemotron-32B-RLBFF-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | Qwen3-Nemotron-32B-RLBFF-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-Nemotron-32B-RLBFF-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen3-Nemotron-32B-RLBFF-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-Nemotron-32B-RLBFF-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-Nemotron-32B-RLBFF-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-Nemotron-32B-RLBFF-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (nvidiaQwen3-Nemotron-32B-RLBFF-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,757
1

Qwen_Qwen2.5-VL-32B-Instruct-GGUF

NaNK
license:apache-2.0
2,748
7

Qwen_Qwen3-VL-8B-Instruct-GGUF

NaNK
2,747
0

Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF

NaNK
2,731
7

zai-org_GLM-4.5-Air-GGUF

Llamacpp imatrix Quantizations of GLM-4.5-Air by zai-org Original model: https://huggingface.co/zai-org/GLM-4.5-Air All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | GLM-4.5-Air-Q80.gguf | Q80 | 117.46GB | true | Extremely high quality, generally unneeded but max available quant. | | GLM-4.5-Air-Q6K.gguf | Q6K | 99.18GB | true | Very high quality, near perfect, recommended. | | GLM-4.5-Air-Q5KM.gguf | Q5KM | 83.72GB | true | High quality, recommended. | | GLM-4.5-Air-Q5KS.gguf | Q5KS | 78.55GB | true | High quality, recommended. | | GLM-4.5-Air-Q4KM.gguf | Q4KM | 73.50GB | true | Good quality, default size for most use cases, recommended. | | GLM-4.5-Air-Q41.gguf | Q41 | 69.55GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | GLM-4.5-Air-Q4KS.gguf | Q4KS | 68.31GB | true | Slightly lower quality with more space savings, recommended. | | GLM-4.5-Air-Q40.gguf | Q40 | 63.76GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | GLM-4.5-Air-IQ4NL.gguf | IQ4NL | 63.06GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | GLM-4.5-Air-IQ4XS.gguf | IQ4XS | 60.81GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | GLM-4.5-Air-Q3KXL.gguf | Q3KXL | 56.45GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | GLM-4.5-Air-Q3KL.gguf | Q3KL | 55.91GB | true | Lower quality but usable, good for low RAM availability. | | GLM-4.5-Air-Q3KM.gguf | Q3KM | 55.48GB | true | Low quality. | | GLM-4.5-Air-IQ3M.gguf | IQ3M | 55.48GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | GLM-4.5-Air-Q3KS.gguf | Q3KS | 53.42GB | true | Low quality, not recommended. | | GLM-4.5-Air-IQ3XS.gguf | IQ3XS | 50.84GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | GLM-4.5-Air-IQ3XXS.gguf | IQ3XXS | 50.34GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | GLM-4.5-Air-Q2KL.gguf | Q2KL | 46.71GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | GLM-4.5-Air-Q2K.gguf | Q2K | 46.10GB | false | Very low quality but surprisingly usable. | | GLM-4.5-Air-IQ2M.gguf | IQ2M | 45.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | GLM-4.5-Air-IQ2S.gguf | IQ2S | 42.54GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Air-IQ2XS.gguf | IQ2XS | 42.19GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Air-IQ2XXS.gguf | IQ2XXS | 39.62GB | false | Very low quality, uses SOTA techniques to be usable. | | GLM-4.5-Air-IQ1M.gguf | IQ1M | 37.86GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (zai-orgGLM-4.5-Air-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

2,730
14

gemma-2-2b-it-abliterated-GGUF

NaNK
2,716
65

Qwen2.5-Coder-3B-GGUF

NaNK
2,712
4

TheDrummer_Behemoth-X-123B-v2.1-GGUF

NaNK
2,709
1

Codestral-22B-v0.1-GGUF

Llamacpp imatrix Quantizations of Codestral-22B-v0.1 Original model: https://huggingface.co/mistralai/Codestral-22B-v0.1 All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Description | | -------- | ---------- | --------- | ----------- | | Codestral-22B-v0.1-Q80.gguf | Q80 | 23.64GB | Extremely high quality, generally unneeded but max available quant. | | Codestral-22B-v0.1-Q6K.gguf | Q6K | 18.25GB | Very high quality, near perfect, recommended. | | Codestral-22B-v0.1-Q5KM.gguf | Q5KM | 15.72GB | High quality, recommended. | | Codestral-22B-v0.1-Q5KS.gguf | Q5KS | 15.32GB | High quality, recommended. | | Codestral-22B-v0.1-Q4KM.gguf | Q4KM | 13.34GB | Good quality, uses about 4.83 bits per weight, recommended. | | Codestral-22B-v0.1-Q4KS.gguf | Q4KS | 12.66GB | Slightly lower quality with more space savings, recommended. | | Codestral-22B-v0.1-IQ4XS.gguf | IQ4XS | 11.93GB | Decent quality, smaller than Q4KS with similar performance, recommended. | | Codestral-22B-v0.1-Q3KL.gguf | Q3KL | 11.73GB | Lower quality but usable, good for low RAM availability. | | Codestral-22B-v0.1-Q3KM.gguf | Q3KM | 10.75GB | Even lower quality. | | Codestral-22B-v0.1-IQ3M.gguf | IQ3M | 10.06GB | Medium-low quality, new method with decent performance comparable to Q3KM. | | Codestral-22B-v0.1-Q3KS.gguf | Q3KS | 9.64GB | Low quality, not recommended. | | Codestral-22B-v0.1-IQ3XS.gguf | IQ3XS | 9.17GB | Lower quality, new method with decent performance, slightly better than Q3KS. | | Codestral-22B-v0.1-IQ3XXS.gguf | IQ3XXS | 8.59GB | Lower quality, new method with decent performance, comparable to Q3 quants. | | Codestral-22B-v0.1-Q2K.gguf | Q2K | 8.27GB | Very low quality but surprisingly usable. | | Codestral-22B-v0.1-IQ2M.gguf | IQ2M | 7.61GB | Very low quality, uses SOTA techniques to also be surprisingly usable. | | Codestral-22B-v0.1-IQ2S.gguf | IQ2S | 7.03GB | Very low quality, uses SOTA techniques to be usable. | | Codestral-22B-v0.1-IQ2XS.gguf | IQ2XS | 6.64GB | Very low quality, uses SOTA techniques to be usable. | First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Codestral-22B-v0.1-Q80) or download them all in place (./) A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,696
187

SicariusSicariiStuff_X-Ray_Alpha-GGUF

2,696
16

google_gemma-3-4b-it-qat-GGUF

NaNK
2,675
6

granite-3.1-8b-instruct-GGUF

NaNK
license:apache-2.0
2,624
12

Qwen_Qwen3-VL-30B-A3B-Thinking-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-30B-A3B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-30B-A3B-Thinking-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Qwen3-VL-30B-A3B-Thinking-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-30B-A3B-Thinking-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-30B-A3B-Thinking-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-30B-A3B-Thinking-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-30B-A3B-Thinking-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Thinking-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Thinking-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Qwen3-VL-30B-A3B-Thinking-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-30B-A3B-Thinking-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Qwen3-VL-30B-A3B-Thinking-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-30B-A3B-Thinking-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-30B-A3B-Thinking-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Thinking-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-30B-A3B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,623
0

google_gemma-3-270m-it-GGUF

2,621
3

nvidia_Nemotron-3-Nano-30B-A3B-GGUF

NaNK
2,608
5

Qwen2.5-Coder-0.5B-GGUF

NaNK
license:apache-2.0
2,579
4

THUDM_GLM-Z1-Rumination-32B-0414-GGUF

NaNK
license:mit
2,578
9

mistralai_Voxtral-Small-24B-2507-GGUF

Llamacpp imatrix Quantizations of Voxtral-Small-24B-2507 by mistralai Original model: https://huggingface.co/mistralai/Voxtral-Small-24B-2507 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Voxtral-Small-24B-2507-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Voxtral-Small-24B-2507-Q80.gguf | Q80 | 25.06GB | false | Extremely high quality, generally unneeded but max available quant. | | Voxtral-Small-24B-2507-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Voxtral-Small-24B-2507-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Voxtral-Small-24B-2507-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Voxtral-Small-24B-2507-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Voxtral-Small-24B-2507-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Voxtral-Small-24B-2507-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Voxtral-Small-24B-2507-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Voxtral-Small-24B-2507-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Voxtral-Small-24B-2507-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Voxtral-Small-24B-2507-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Voxtral-Small-24B-2507-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Voxtral-Small-24B-2507-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Voxtral-Small-24B-2507-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Voxtral-Small-24B-2507-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Voxtral-Small-24B-2507-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Voxtral-Small-24B-2507-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Voxtral-Small-24B-2507-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Voxtral-Small-24B-2507-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Voxtral-Small-24B-2507-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Voxtral-Small-24B-2507-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Voxtral-Small-24B-2507-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Voxtral-Small-24B-2507-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Voxtral-Small-24B-2507-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Voxtral-Small-24B-2507-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mistralaiVoxtral-Small-24B-2507-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,570
9

Qwen2.5-Coder-14B-Instruct-GGUF

NaNK
license:apache-2.0
2,553
38

google_gemma-3-12b-it-qat-GGUF

NaNK
2,540
23

TheDrummer_Gemmasutra-Small-4B-v1-GGUF

NaNK
2,523
5

OpenGVLab_InternVL3_5-38B-GGUF

Llamacpp imatrix Quantizations of InternVL35-38B by OpenGVLab Original model: https://huggingface.co/OpenGVLab/InternVL35-38B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | InternVL35-38B-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | InternVL35-38B-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | InternVL35-38B-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | InternVL35-38B-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | InternVL35-38B-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | InternVL35-38B-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | InternVL35-38B-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | InternVL35-38B-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | InternVL35-38B-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | InternVL35-38B-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | InternVL35-38B-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | InternVL35-38B-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | InternVL35-38B-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | InternVL35-38B-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | InternVL35-38B-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | InternVL35-38B-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | InternVL35-38B-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | InternVL35-38B-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | InternVL35-38B-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | InternVL35-38B-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | InternVL35-38B-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | InternVL35-38B-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | InternVL35-38B-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | InternVL35-38B-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | InternVL35-38B-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | InternVL35-38B-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | InternVL35-38B-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (OpenGVLabInternVL35-38B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,522
0

Qwen2.5-3B-Instruct-GGUF

NaNK
2,519
16

DeepSeek-R1-Distill-Llama-70B-GGUF

NaNK
base_model:deepseek-ai/DeepSeek-R1-Distill-Llama-70B
2,487
35

L3.3-MS-Nevoria-70b-GGUF

NaNK
2,486
14

Dolphin3.0-Llama3.2-1B-GGUF

NaNK
base_model:dphn/Dolphin3.0-Llama3.2-1B
2,476
7

Cat-Llama-3-70B-instruct-GGUF

NaNK
license:llama3
2,470
10

Dolphin3.0-Llama3.2-3B-GGUF

NaNK
base_model:dphn/Dolphin3.0-Llama3.2-3B
2,469
25

Qwen_Qwen3-VL-32B-Thinking-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-32B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-32B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-32B-Thinking-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | Qwen3-VL-32B-Thinking-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-32B-Thinking-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Thinking-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Thinking-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-32B-Thinking-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | Qwen3-VL-32B-Thinking-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen3-VL-32B-Thinking-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-32B-Thinking-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-32B-Thinking-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-32B-Thinking-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-32B-Thinking-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-32B-Thinking-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-32B-Thinking-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Thinking-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-32B-Thinking-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Thinking-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | Qwen3-VL-32B-Thinking-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-32B-Thinking-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen3-VL-32B-Thinking-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-32B-Thinking-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-32B-Thinking-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-32B-Thinking-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-32B-Thinking-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-32B-Thinking-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Thinking-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Thinking-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-32B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,469
1

EVA-LLaMA-3.33-70B-v0.0-GGUF

NaNK
base_model:EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.0
2,465
7

Qwen_Qwen3-VL-8B-Thinking-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-8B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-8B-Thinking-bf16.gguf | bf16 | 16.39GB | false | Full BF16 weights. | | Qwen3-VL-8B-Thinking-Q80.gguf | Q80 | 8.71GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-8B-Thinking-Q6KL.gguf | Q6KL | 7.03GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-8B-Thinking-Q6K.gguf | Q6K | 6.73GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-8B-Thinking-Q5KL.gguf | Q5KL | 6.24GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-8B-Thinking-Q5KM.gguf | Q5KM | 5.85GB | false | High quality, recommended. | | Qwen3-VL-8B-Thinking-Q5KS.gguf | Q5KS | 5.72GB | false | High quality, recommended. | | Qwen3-VL-8B-Thinking-Q4KL.gguf | Q4KL | 5.49GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-8B-Thinking-Q41.gguf | Q41 | 5.25GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-8B-Thinking-Q4KM.gguf | Q4KM | 5.03GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-8B-Thinking-Q3KXL.gguf | Q3KXL | 4.98GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-8B-Thinking-Q4KS.gguf | Q4KS | 4.80GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-8B-Thinking-Q40.gguf | Q40 | 4.79GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-8B-Thinking-IQ4NL.gguf | IQ4NL | 4.79GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-8B-Thinking-IQ4XS.gguf | IQ4XS | 4.56GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-8B-Thinking-Q3KL.gguf | Q3KL | 4.43GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-8B-Thinking-Q3KM.gguf | Q3KM | 4.12GB | false | Low quality. | | Qwen3-VL-8B-Thinking-IQ3M.gguf | IQ3M | 3.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-8B-Thinking-Q2KL.gguf | Q2KL | 3.89GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-8B-Thinking-Q3KS.gguf | Q3KS | 3.77GB | false | Low quality, not recommended. | | Qwen3-VL-8B-Thinking-IQ3XS.gguf | IQ3XS | 3.63GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-8B-Thinking-IQ3XXS.gguf | IQ3XXS | 3.37GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-8B-Thinking-Q2K.gguf | Q2K | 3.28GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-8B-Thinking-IQ2M.gguf | IQ2M | 3.05GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-8B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,452
2

Qwen_Qwen3-14B-GGUF

NaNK
2,447
23

Gryphe_Codex-24B-Small-3.2-GGUF

NaNK
license:apache-2.0
2,436
7

Qwen_Qwen3-4B-Thinking-2507-GGUF

NaNK
2,432
14

TheDrummer_Fallen-Gemma3-27B-v1-GGUF

NaNK
2,420
19

Kwaipilot_KAT-Dev-GGUF

2,401
2

MiniCPM-V-2_6-GGUF

NaNK
2,388
24

MathCoder2-CodeLlama-7B-GGUF

NaNK
base_model:MathGenie/MathCoder2-CodeLlama-7B
2,366
4

baidu_ERNIE-4.5-21B-A3B-PT-GGUF

NaNK
2,355
5

huihui-ai_QwQ-32B-abliterated-GGUF

NaNK
license:apache-2.0
2,351
44

TheDrummer_Anubis-70B-v1.1-GGUF

NaNK
2,349
4

cognitivecomputations_Dolphin3.0-Mistral-24B-GGUF

NaNK
2,348
17

SmolLM2-135M-Instruct-GGUF

license:apache-2.0
2,344
8

Llama-3_1-Nemotron-51B-Instruct-GGUF

NaNK
llama-3
2,341
5

Llama-3.3-70B-Instruct-ablated-GGUF

NaNK
llama
2,335
16

OLMo-2-1124-7B-Instruct-GGUF

NaNK
license:apache-2.0
2,333
11

BlackSheep-RP-12B-GGUF

NaNK
2,323
11

huihui-ai_DeepSeek-R1-Distill-Llama-70B-abliterated-GGUF

NaNK
base_model:huihui-ai/DeepSeek-R1-Distill-Llama-70B-abliterated
2,307
28

UI-TARS-7B-DPO-GGUF

NaNK
license:apache-2.0
2,299
9

TheDrummer_Cydonia-R1-24B-v4.1-GGUF

NaNK
2,285
4

ibm-granite_granite-4.0-h-micro-GGUF

Llamacpp imatrix Quantizations of granite-4.0-h-micro by ibm-granite Original model: https://huggingface.co/ibm-granite/granite-4.0-h-micro All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | granite-4.0-h-micro-bf16.gguf | bf16 | 6.39GB | false | Full BF16 weights. | | granite-4.0-h-micro-Q80.gguf | Q80 | 3.40GB | false | Extremely high quality, generally unneeded but max available quant. | | granite-4.0-h-micro-Q6KL.gguf | Q6KL | 2.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | granite-4.0-h-micro-Q6K.gguf | Q6K | 2.63GB | false | Very high quality, near perfect, recommended. | | granite-4.0-h-micro-Q5KL.gguf | Q5KL | 2.32GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | granite-4.0-h-micro-Q5KM.gguf | Q5KM | 2.27GB | false | High quality, recommended. | | granite-4.0-h-micro-Q5KS.gguf | Q5KS | 2.23GB | false | High quality, recommended. | | granite-4.0-h-micro-Q41.gguf | Q41 | 2.04GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | granite-4.0-h-micro-Q4KL.gguf | Q4KL | 1.99GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | granite-4.0-h-micro-Q4KM.gguf | Q4KM | 1.94GB | false | Good quality, default size for most use cases, recommended. | | granite-4.0-h-micro-Q4KS.gguf | Q4KS | 1.87GB | false | Slightly lower quality with more space savings, recommended. | | granite-4.0-h-micro-Q40.gguf | Q40 | 1.86GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | granite-4.0-h-micro-IQ4NL.gguf | IQ4NL | 1.86GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | granite-4.0-h-micro-IQ4XS.gguf | IQ4XS | 1.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | granite-4.0-h-micro-Q3KXL.gguf | Q3KXL | 1.69GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | granite-4.0-h-micro-Q3KL.gguf | Q3KL | 1.64GB | false | Lower quality but usable, good for low RAM availability. | | granite-4.0-h-micro-Q3KM.gguf | Q3KM | 1.56GB | false | Low quality. | | granite-4.0-h-micro-IQ3M.gguf | IQ3M | 1.47GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | granite-4.0-h-micro-Q3KS.gguf | Q3KS | 1.46GB | false | Low quality, not recommended. | | granite-4.0-h-micro-IQ3XS.gguf | IQ3XS | 1.41GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | granite-4.0-h-micro-IQ3XXS.gguf | IQ3XXS | 1.29GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | granite-4.0-h-micro-Q2KL.gguf | Q2KL | 1.28GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | granite-4.0-h-micro-Q2K.gguf | Q2K | 1.23GB | false | Very low quality but surprisingly usable. | | granite-4.0-h-micro-IQ2M.gguf | IQ2M | 1.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ibm-granitegranite-4.0-h-micro-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

2,282
2

LiquidAI_LFM2-350M-Math-GGUF

2,263
0

meta-llama_Llama-4-Scout-17B-16E-Instruct-old-GGUF

NaNK
llama
2,254
31

aya-23-8B-GGUF

NaNK
license:cc-by-nc-4.0
2,251
48

HuatuoGPT-o1-72B-v0.1-GGUF

NaNK
license:apache-2.0
2,242
1

Phi-3-mini-4k-instruct-GGUF

license:mit
2,239
11

Llama-3.1-8B-Lexi-Uncensored-V2-GGUF

NaNK
base_model:Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2
2,223
16

ibm-granite_granite-vision-3.2-2b-GGUF

NaNK
license:apache-2.0
2,214
8

magnum-12b-v2-GGUF

NaNK
license:apache-2.0
2,190
16

nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF

NaNK
llama-3
2,184
49

soob3123_amoral-gemma3-4B-GGUF

NaNK
license:apache-2.0
2,182
4

reader-lm-1.5b-GGUF

NaNK
license:cc-by-nc-4.0
2,181
15

Qwen2.5-Coder-14B-GGUF

NaNK
license:apache-2.0
2,163
11

agentica-org_DeepCoder-14B-Preview-GGUF

NaNK
license:mit
2,159
62

nbeerbower_Qwen3-Gutenberg-Encore-14B-GGUF

NaNK
license:apache-2.0
2,147
3

Qwen_Qwen3-8B-GGUF

NaNK
license:apache-2.0
2,139
22

Human-Like-LLama3-8B-Instruct-GGUF

NaNK
base_model:HumanLLMs/Human-Like-LLama3-8B-Instruct
2,125
2

nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-GGUF

Llamacpp imatrix Quantizations of Llama-33-Nemotron-Super-49B-v15 by nvidia Original model: https://huggingface.co/nvidia/Llama-33-Nemotron-Super-49B-v15 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Llama-33-Nemotron-Super-49B-v15-bf16.gguf | bf16 | 99.74GB | true | Full BF16 weights. | | Llama-33-Nemotron-Super-49B-v15-Q80.gguf | Q80 | 52.99GB | true | Extremely high quality, generally unneeded but max available quant. | | Llama-33-Nemotron-Super-49B-v15-Q6KL.gguf | Q6KL | 41.43GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q6K.gguf | Q6K | 40.92GB | false | Very high quality, near perfect, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q5KL.gguf | Q5KL | 36.04GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q5KM.gguf | Q5KM | 35.39GB | false | High quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q5KS.gguf | Q5KS | 34.43GB | false | High quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q41.gguf | Q41 | 31.38GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Llama-33-Nemotron-Super-49B-v15-Q4KL.gguf | Q4KL | 31.00GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q4KM.gguf | Q4KM | 30.22GB | false | Good quality, default size for most use cases, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q4KS.gguf | Q4KS | 28.63GB | false | Slightly lower quality with more space savings, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q40.gguf | Q40 | 28.46GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Llama-33-Nemotron-Super-49B-v15-IQ4NL.gguf | IQ4NL | 28.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Llama-33-Nemotron-Super-49B-v15-Q3KXL.gguf | Q3KXL | 27.19GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Llama-33-Nemotron-Super-49B-v15-IQ4XS.gguf | IQ4XS | 26.87GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q3KL.gguf | Q3KL | 26.27GB | false | Lower quality but usable, good for low RAM availability. | | Llama-33-Nemotron-Super-49B-v15-Q3KM.gguf | Q3KM | 24.31GB | false | Low quality. | | Llama-33-Nemotron-Super-49B-v15-IQ3M.gguf | IQ3M | 22.66GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Llama-33-Nemotron-Super-49B-v15-Q3KS.gguf | Q3KS | 21.96GB | false | Low quality, not recommended. | | Llama-33-Nemotron-Super-49B-v15-IQ3XS.gguf | IQ3XS | 20.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Llama-33-Nemotron-Super-49B-v15-Q2KL.gguf | Q2KL | 19.77GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Llama-33-Nemotron-Super-49B-v15-IQ3XXS.gguf | IQ3XXS | 19.52GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Llama-33-Nemotron-Super-49B-v15-Q2K.gguf | Q2K | 18.74GB | false | Very low quality but surprisingly usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2M.gguf | IQ2M | 17.16GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2S.gguf | IQ2S | 15.85GB | false | Low quality, uses SOTA techniques to be usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2XS.gguf | IQ2XS | 15.08GB | false | Low quality, uses SOTA techniques to be usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2XXS.gguf | IQ2XXS | 13.66GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (nvidiaLlama-33-Nemotron-Super-49B-v15-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
base_model:nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
2,115
22

zerofata_MS3.2-PaintedFantasy-Visage-v3-34B-GGUF

Llamacpp imatrix Quantizations of MS3.2-PaintedFantasy-Visage-v3-34B by zerofata Original model: https://huggingface.co/zerofata/MS3.2-PaintedFantasy-Visage-v3-34B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | MS3.2-PaintedFantasy-Visage-v3-34B-bf16.gguf | bf16 | 68.27GB | true | Full BF16 weights. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q80.gguf | Q80 | 36.27GB | false | Extremely high quality, generally unneeded but max available quant. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q6KL.gguf | Q6KL | 28.33GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q6K.gguf | Q6K | 28.01GB | false | Very high quality, near perfect, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q5KL.gguf | Q5KL | 24.65GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q5KM.gguf | Q5KM | 24.23GB | false | High quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q5KS.gguf | Q5KS | 23.56GB | false | High quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q41.gguf | Q41 | 21.47GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q4KL.gguf | Q4KL | 21.17GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q4KM.gguf | Q4KM | 20.68GB | false | Good quality, default size for most use cases, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q4KS.gguf | Q4KS | 19.53GB | false | Slightly lower quality with more space savings, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q40.gguf | Q40 | 19.46GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ4NL.gguf | IQ4NL | 19.42GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KXL.gguf | Q3KXL | 18.48GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ4XS.gguf | IQ4XS | 18.38GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KL.gguf | Q3KL | 17.89GB | false | Lower quality but usable, good for low RAM availability. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KM.gguf | Q3KM | 16.52GB | false | Low quality. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ3M.gguf | IQ3M | 15.30GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KS.gguf | Q3KS | 14.94GB | false | Low quality, not recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ3XS.gguf | IQ3XS | 14.21GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q2KL.gguf | Q2KL | 13.40GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ3XXS.gguf | IQ3XXS | 13.33GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q2K.gguf | Q2K | 12.74GB | false | Very low quality but surprisingly usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2M.gguf | IQ2M | 11.60GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2S.gguf | IQ2S | 10.66GB | false | Low quality, uses SOTA techniques to be usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2XS.gguf | IQ2XS | 10.30GB | false | Low quality, uses SOTA techniques to be usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2XXS.gguf | IQ2XXS | 9.32GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (zerofataMS3.2-PaintedFantasy-Visage-v3-34B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
2,112
5

swiss-ai_Apertus-70B-Instruct-2509-GGUF

NaNK
2,104
4

inclusionAI_Ling-mini-2.0-GGUF

NaNK
2,102
1

internlm_OREAL-DeepSeek-R1-Distill-Qwen-7B-GGUF

NaNK
license:apache-2.0
2,101
0

OpenGVLab_InternVL3_5-14B-GGUF

NaNK
2,098
7

Qwen_Qwen3-4B-GGUF

NaNK
2,092
17

inclusionAI_Ring-mini-2.0-GGUF

NaNK
2,086
2

Delta-Vector_Austral-32B-GLM4-Winton-GGUF

NaNK
2,050
3

nomic-ai_nomic-embed-code-GGUF

NaNK
license:apache-2.0
2,046
3

Qwen2.5-Coder-1.5B-Instruct-GGUF

NaNK
license:apache-2.0
2,039
9

agentica-org_DeepSWE-Preview-GGUF

license:mit
2,034
9

Qwen_Qwen2.5-VL-7B-Instruct-GGUF

NaNK
license:apache-2.0
2,033
6

Qwen2.5-14B_Uncencored-GGUF

NaNK
license:apache-2.0
2,026
15

nvidia_Llama-3.1-Nemotron-Nano-8B-v1-GGUF

NaNK
llama-3
2,025
10

AllThingsIntel_Apollo-V0.1-4B-Thinking-GGUF

NaNK
2,013
2

EVA-Qwen2.5-32B-v0.2-GGUF

NaNK
license:apache-2.0
2,012
10

v2ray_GPT4chan-24B-GGUF

NaNK
license:mit
2,011
0

ibm-granite_granite-4.0-h-tiny-GGUF

Llamacpp imatrix Quantizations of granite-4.0-h-tiny by ibm-granite Original model: https://huggingface.co/ibm-granite/granite-4.0-h-tiny All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | granite-4.0-h-tiny-bf16.gguf | bf16 | 13.89GB | false | Full BF16 weights. | | granite-4.0-h-tiny-Q80.gguf | Q80 | 7.39GB | false | Extremely high quality, generally unneeded but max available quant. | | granite-4.0-h-tiny-Q6KL.gguf | Q6KL | 5.79GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | granite-4.0-h-tiny-Q6K.gguf | Q6K | 5.76GB | false | Very high quality, near perfect, recommended. | | granite-4.0-h-tiny-Q5KL.gguf | Q5KL | 5.05GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | granite-4.0-h-tiny-Q5KM.gguf | Q5KM | 5.02GB | false | High quality, recommended. | | granite-4.0-h-tiny-Q5KS.gguf | Q5KS | 4.86GB | false | High quality, recommended. | | granite-4.0-h-tiny-Q41.gguf | Q41 | 4.44GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | granite-4.0-h-tiny-Q4KL.gguf | Q4KL | 4.33GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | granite-4.0-h-tiny-Q4KM.gguf | Q4KM | 4.30GB | false | Good quality, default size for most use cases, recommended. | | granite-4.0-h-tiny-Q4KS.gguf | Q4KS | 4.15GB | false | Slightly lower quality with more space savings, recommended. | | granite-4.0-h-tiny-Q40.gguf | Q40 | 4.09GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | granite-4.0-h-tiny-IQ4NL.gguf | IQ4NL | 4.02GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | granite-4.0-h-tiny-IQ4XS.gguf | IQ4XS | 3.82GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | granite-4.0-h-tiny-Q3KXL.gguf | Q3KXL | 3.45GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | granite-4.0-h-tiny-Q3KL.gguf | Q3KL | 3.41GB | false | Lower quality but usable, good for low RAM availability. | | granite-4.0-h-tiny-Q3KM.gguf | Q3KM | 3.29GB | false | Low quality. | | granite-4.0-h-tiny-IQ3M.gguf | IQ3M | 3.29GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | granite-4.0-h-tiny-Q3KS.gguf | Q3KS | 3.15GB | false | Low quality, not recommended. | | granite-4.0-h-tiny-IQ3XS.gguf | IQ3XS | 3.01GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | granite-4.0-h-tiny-IQ3XXS.gguf | IQ3XXS | 2.87GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | granite-4.0-h-tiny-Q2KL.gguf | Q2KL | 2.62GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | granite-4.0-h-tiny-Q2K.gguf | Q2K | 2.59GB | false | Very low quality but surprisingly usable. | | granite-4.0-h-tiny-IQ2M.gguf | IQ2M | 2.29GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ibm-granitegranite-4.0-h-tiny-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

2,004
1

Delta-Vector_Austral-24B-Winton-GGUF

NaNK
license:apache-2.0
2,003
2

huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF

NaNK
license:apache-2.0
1,998
8

granite-20b-code-instruct-GGUF

NaNK
dataset:bigcode/commitpackft
1,996
7

swiss-ai_Apertus-8B-Instruct-2509-GGUF

NaNK
1,989
2

uncensoredai_UncensoredLM-DeepSeek-R1-Distill-Qwen-14B-GGUF

NaNK
license:apache-2.0
1,986
32

zai-org_GLM-4.7-Flash-GGUF

license:mit
1,982
14

FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview-GGUF

NaNK
1,978
44

TheDrummer_Valkyrie-49B-v1-GGUF

NaNK
1,952
13

LLaMA-Mesh-GGUF

base_model:Zhengyi/LLaMA-Mesh
1,949
33

LiquidAI_LFM2-8B-A1B-GGUF

Llamacpp imatrix Quantizations of LFM2-8B-A1B by LiquidAI Original model: https://huggingface.co/LiquidAI/LFM2-8B-A1B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | LFM2-8B-A1B-bf16.gguf | bf16 | 16.69GB | false | Full BF16 weights. | | LFM2-8B-A1B-Q80.gguf | Q80 | 8.87GB | false | Extremely high quality, generally unneeded but max available quant. | | LFM2-8B-A1B-Q6KL.gguf | Q6KL | 6.88GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | LFM2-8B-A1B-Q6K.gguf | Q6K | 6.85GB | false | Very high quality, near perfect, recommended. | | LFM2-8B-A1B-Q5KL.gguf | Q5KL | 5.95GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | LFM2-8B-A1B-Q5KM.gguf | Q5KM | 5.92GB | false | High quality, recommended. | | LFM2-8B-A1B-Q5KS.gguf | Q5KS | 5.76GB | false | High quality, recommended. | | LFM2-8B-A1B-Q41.gguf | Q41 | 5.25GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | LFM2-8B-A1B-Q4KL.gguf | Q4KL | 5.08GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | LFM2-8B-A1B-Q4KM.gguf | Q4KM | 5.05GB | false | Good quality, default size for most use cases, recommended. | | LFM2-8B-A1B-Q4KS.gguf | Q4KS | 4.89GB | false | Slightly lower quality with more space savings, recommended. | | LFM2-8B-A1B-Q40.gguf | Q40 | 4.81GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | LFM2-8B-A1B-IQ4NL.gguf | IQ4NL | 4.74GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | LFM2-8B-A1B-IQ4XS.gguf | IQ4XS | 4.48GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | LFM2-8B-A1B-Q3KXL.gguf | Q3KXL | 3.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | LFM2-8B-A1B-Q3KL.gguf | Q3KL | 3.96GB | false | Lower quality but usable, good for low RAM availability. | | LFM2-8B-A1B-Q3KM.gguf | Q3KM | 3.82GB | false | Low quality. | | LFM2-8B-A1B-IQ3M.gguf | IQ3M | 3.82GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | LFM2-8B-A1B-Q3KS.gguf | Q3KS | 3.65GB | false | Low quality, not recommended. | | LFM2-8B-A1B-IQ3XS.gguf | IQ3XS | 3.46GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | LFM2-8B-A1B-IQ3XXS.gguf | IQ3XXS | 3.31GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | LFM2-8B-A1B-Q2KL.gguf | Q2KL | 2.98GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | LFM2-8B-A1B-Q2K.gguf | Q2K | 2.95GB | false | Very low quality but surprisingly usable. | | LFM2-8B-A1B-IQ2M.gguf | IQ2M | 2.65GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (LiquidAILFM2-8B-A1B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,946
7

zerofata_GLM-4.5-Iceblink-106B-A12B-GGUF

Llamacpp imatrix Quantizations of GLM-4.5-Iceblink-106B-A12B by zerofata Original model: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | GLM-4.5-Iceblink-106B-A12B-Q80.gguf | Q80 | 117.46GB | true | Extremely high quality, generally unneeded but max available quant. | | GLM-4.5-Iceblink-106B-A12B-Q6K.gguf | Q6K | 99.18GB | true | Very high quality, near perfect, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q5KM.gguf | Q5KM | 83.72GB | true | High quality, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q5KS.gguf | Q5KS | 78.55GB | true | High quality, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q4KM.gguf | Q4KM | 73.50GB | true | Good quality, default size for most use cases, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q41.gguf | Q41 | 69.55GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | GLM-4.5-Iceblink-106B-A12B-Q4KS.gguf | Q4KS | 68.31GB | true | Slightly lower quality with more space savings, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q40.gguf | Q40 | 63.76GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | GLM-4.5-Iceblink-106B-A12B-IQ4NL.gguf | IQ4NL | 63.06GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | GLM-4.5-Iceblink-106B-A12B-IQ4XS.gguf | IQ4XS | 60.81GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q3KXL.gguf | Q3KXL | 56.45GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | GLM-4.5-Iceblink-106B-A12B-Q3KL.gguf | Q3KL | 55.91GB | true | Lower quality but usable, good for low RAM availability. | | GLM-4.5-Iceblink-106B-A12B-Q3KM.gguf | Q3KM | 55.48GB | true | Low quality. | | GLM-4.5-Iceblink-106B-A12B-IQ3M.gguf | IQ3M | 55.48GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | GLM-4.5-Iceblink-106B-A12B-Q3KS.gguf | Q3KS | 53.42GB | true | Low quality, not recommended. | | GLM-4.5-Iceblink-106B-A12B-IQ3XS.gguf | IQ3XS | 50.84GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | GLM-4.5-Iceblink-106B-A12B-IQ3XXS.gguf | IQ3XXS | 50.34GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | GLM-4.5-Iceblink-106B-A12B-Q2KL.gguf | Q2KL | 46.71GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | GLM-4.5-Iceblink-106B-A12B-Q2K.gguf | Q2K | 46.10GB | false | Very low quality but surprisingly usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2M.gguf | IQ2M | 45.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2S.gguf | IQ2S | 42.54GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2XS.gguf | IQ2XS | 42.19GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2XXS.gguf | IQ2XXS | 39.62GB | false | Very low quality, uses SOTA techniques to be usable. | | GLM-4.5-Iceblink-106B-A12B-IQ1M.gguf | IQ1M | 37.86GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (zerofataGLM-4.5-Iceblink-106B-A12B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,944
3

a-m-team_AM-Thinking-v1-GGUF

NaNK
license:apache-2.0
1,924
1

baichuan-inc_Baichuan-M2-32B-GGUF

NaNK
1,910
5

ai21labs_AI21-Jamba-Large-1.7-GGUF

Llamacpp imatrix Quantizations of AI21-Jamba-Large-1.7 by ai21labs Original model: https://huggingface.co/ai21labs/AI21-Jamba-Large-1.7 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | AI21-Jamba-Large-1.7-Q80.gguf | Q80 | 423.55GB | true | Extremely high quality, generally unneeded but max available quant. | | AI21-Jamba-Large-1.7-Q6K.gguf | Q6K | 327.05GB | true | Very high quality, near perfect, recommended. | | AI21-Jamba-Large-1.7-Q5KM.gguf | Q5KM | 282.39GB | true | High quality, recommended. | | AI21-Jamba-Large-1.7-Q5KS.gguf | Q5KS | 274.21GB | true | High quality, recommended. | | AI21-Jamba-Large-1.7-Q41.gguf | Q41 | 249.34GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | AI21-Jamba-Large-1.7-Q4KL.gguf | Q4KL | 240.83GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | AI21-Jamba-Large-1.7-Q4KM.gguf | Q4KM | 240.44GB | true | Good quality, default size for most use cases, recommended. | | AI21-Jamba-Large-1.7-Q4KS.gguf | Q4KS | 231.92GB | true | Slightly lower quality with more space savings, recommended. | | AI21-Jamba-Large-1.7-Q40.gguf | Q40 | 228.15GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | AI21-Jamba-Large-1.7-IQ4NL.gguf | IQ4NL | 224.54GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | AI21-Jamba-Large-1.7-IQ4XS.gguf | IQ4XS | 212.13GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | AI21-Jamba-Large-1.7-Q3KXL.gguf | Q3KXL | 188.93GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Large-1.7-Q3KL.gguf | Q3KL | 188.46GB | true | Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Large-1.7-Q3KM.gguf | Q3KM | 180.50GB | true | Low quality. | | AI21-Jamba-Large-1.7-IQ3M.gguf | IQ3M | 179.62GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | AI21-Jamba-Large-1.7-Q3KS.gguf | Q3KS | 171.77GB | true | Low quality, not recommended. | | AI21-Jamba-Large-1.7-IQ3XS.gguf | IQ3XS | 163.08GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | AI21-Jamba-Large-1.7-IQ3XXS.gguf | IQ3XXS | 155.79GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | AI21-Jamba-Large-1.7-Q2KL.gguf | Q2KL | 138.58GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | AI21-Jamba-Large-1.7-Q2K.gguf | Q2K | 138.06GB | true | Very low quality but surprisingly usable. | | AI21-Jamba-Large-1.7-IQ2M.gguf | IQ2M | 125.79GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | AI21-Jamba-Large-1.7-IQ2S.gguf | IQ2S | 111.14GB | true | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Large-1.7-IQ2XS.gguf | IQ2XS | 110.71GB | true | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Large-1.7-IQ2XXS.gguf | IQ2XXS | 96.37GB | true | Very low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Large-1.7-IQ1M.gguf | IQ1M | 85.91GB | true | Extremely low quality, not recommended. | | AI21-Jamba-Large-1.7-IQ1S.gguf | IQ1S | 81.88GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ai21labsAI21-Jamba-Large-1.7-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,889
1

DeepSeek-R1-GGUF

NaNK
license:mit
1,872
104

zetasepic_Mistral-Small-Instruct-2409-abliterated-GGUF

1,861
1

allura-org_Q3-30B-A3B-Designant-GGUF

NaNK
1,836
4

mlabonne_gemma-3-12b-it-abliterated-GGUF

NaNK
1,835
9

google_gemma-3n-E2B-it-GGUF

NaNK
1,831
10

EXAONE-3.0-7.8B-Instruct-GGUF

NaNK
1,825
8

sophosympatheia_Strawberrylemonade-70B-v1.1-GGUF

NaNK
license:llama3
1,818
0

MN-12B-Celeste-V1.9-GGUF

NaNK
license:apache-2.0
1,800
18

Llama-3.1-SuperNova-Lite-GGUF

base_model:arcee-ai/Llama-3.1-SuperNova-Lite
1,799
11

perplexity-ai_r1-1776-distill-llama-70b-GGUF

NaNK
base_model:perplexity-ai/r1-1776-distill-llama-70b
1,797
9

Qwen2.5-32B-ArliAI-RPMax-v1.3-GGUF

NaNK
license:apache-2.0
1,796
12

QVQ-72B-Preview-GGUF

NaNK
1,793
53

LongWriter-llama3.1-8b-GGUF

NaNK
llama
1,793
29

Qwen_Qwen3-VL-2B-Instruct-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-2B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-2B-Instruct-bf16.gguf | bf16 | 3.45GB | false | Full BF16 weights. | | Qwen3-VL-2B-Instruct-Q80.gguf | Q80 | 1.83GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-2B-Instruct-Q6KL.gguf | Q6KL | 1.49GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Instruct-Q6K.gguf | Q6K | 1.42GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Instruct-Q5KL.gguf | Q5KL | 1.33GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-2B-Instruct-Q5KM.gguf | Q5KM | 1.26GB | false | High quality, recommended. | | Qwen3-VL-2B-Instruct-Q5KS.gguf | Q5KS | 1.23GB | false | High quality, recommended. | | Qwen3-VL-2B-Instruct-Q4KL.gguf | Q4KL | 1.18GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-2B-Instruct-Q41.gguf | Q41 | 1.14GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-2B-Instruct-Q4KM.gguf | Q4KM | 1.11GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-2B-Instruct-Q3KXL.gguf | Q3KXL | 1.08GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Instruct-Q4KS.gguf | Q4KS | 1.06GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-2B-Instruct-Q40.gguf | Q40 | 1.06GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-2B-Instruct-IQ4NL.gguf | IQ4NL | 1.05GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-2B-Instruct-IQ4XS.gguf | IQ4XS | 1.01GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-2B-Instruct-Q3KL.gguf | Q3KL | 1.00GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Instruct-Q3KM.gguf | Q3KM | 0.94GB | false | Low quality. | | Qwen3-VL-2B-Instruct-IQ3M.gguf | IQ3M | 0.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-2B-Instruct-Q3KS.gguf | Q3KS | 0.87GB | false | Low quality, not recommended. | | Qwen3-VL-2B-Instruct-Q2KL.gguf | Q2KL | 0.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-2B-Instruct-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-2B-Instruct-Q2K.gguf | Q2K | 0.78GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-2B-Instruct-IQ3XXS.gguf | IQ3XXS | 0.75GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-2B-Instruct-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-2B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,786
1

Qwen2.5-Coder-0.5B-Instruct-GGUF

NaNK
license:apache-2.0
1,781
4

nvidia_AceReason-Nemotron-14B-GGUF

NaNK
1,769
8

Qwen_Qwen3-VL-4B-Thinking-GGUF

NaNK
1,768
1

Menlo_Lucy-GGUF

1,754
4

Human-Like-Mistral-Nemo-Instruct-2407-GGUF

license:apache-2.0
1,754
4

Qwen_Qwen3-1.7B-GGUF

NaNK
1,750
12

MN-12B-Mag-Mell-R1-GGUF

NaNK
1,750
11

burtenshaw_GemmaCoder3-12B-GGUF

NaNK
1,739
10

Qwen2.5-14B-Instruct-1M-GGUF

NaNK
license:apache-2.0
1,733
44

granite-3.0-8b-instruct-GGUF

NaNK
license:apache-2.0
1,733
11

Impish_Mind_8B-GGUF

NaNK
license:apache-2.0
1,722
14

Stheno-Hercules-3.1-8B-GGUF

NaNK
Locutusque/Hercules-6.1-Llama-3.1-8B
1,713
3

TheDrummer_Cydonia-R1-24B-v4-GGUF

NaNK
1,684
5

Meta-Llama-3.1-8B-Claude-GGUF

NaNK
base_model:Undi95/Meta-Llama-3.1-8B-Claude
1,663
12

open-thoughts_OpenThinker-32B-GGUF

NaNK
llama-factory
1,659
7

nvidia_Llama-3.1-Nemotron-Nano-4B-v1.1-GGUF

NaNK
llama-3
1,655
10

v6-Finch-7B-HF-GGUF

NaNK
license:apache-2.0
1,654
0

PocketDoc_Dans-PersonalityEngine-V1.3.0-12b-GGUF

NaNK
license:apache-2.0
1,652
6

TheDrummer_Behemoth-R1-123B-v2-GGUF

NaNK
1,650
2

google_medgemma-27b-it-GGUF

NaNK
1,648
5

SmallThinker-3B-Preview-GGUF

NaNK
1,642
31

mistralai_Magistral-Small-2506-GGUF

license:apache-2.0
1,634
17

Sailor2-1B-Chat-GGUF

NaNK
license:apache-2.0
1,632
2

ilsp_Llama-Krikri-8B-Instruct-GGUF

NaNK
base_model:ilsp/Llama-Krikri-8B-Instruct
1,627
1

xai-org_grok-2-GGUF

Llamacpp imatrix Quantizations of grok-2 by xai-org Original model: https://huggingface.co/xai-org/grok-2 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | grok-2-Q80.gguf | Q80 | 286.39GB | true | Extremely high quality, generally unneeded but max available quant. | | grok-2-Q6K.gguf | Q6K | 221.37GB | true | Very high quality, near perfect, recommended. | | grok-2-Q5KM.gguf | Q5KM | 191.57GB | true | High quality, recommended. | | grok-2-Q5KS.gguf | Q5KS | 185.87GB | true | High quality, recommended. | | grok-2-Q41.gguf | Q41 | 169.16GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | grok-2-Q4KL.gguf | Q4KL | 164.85GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | grok-2-Q4KM.gguf | Q4KM | 164.06GB | true | Good quality, default size for most use cases, recommended. | | grok-2-Q4KS.gguf | Q4KS | 157.55GB | true | Slightly lower quality with more space savings, recommended. | | grok-2-Q40.gguf | Q40 | 154.73GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | grok-2-IQ4NL.gguf | IQ4NL | 152.98GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | grok-2-IQ4XS.gguf | IQ4XS | 144.76GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | grok-2-Q3KXL.gguf | Q3KXL | 131.16GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | grok-2-Q3KL.gguf | Q3KL | 130.22GB | true | Lower quality but usable, good for low RAM availability. | | grok-2-Q3KM.gguf | Q3KM | 125.02GB | true | Low quality. | | grok-2-IQ3M.gguf | IQ3M | 123.75GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | grok-2-Q3KS.gguf | Q3KS | 118.04GB | true | Low quality, not recommended. | | grok-2-IQ3XS.gguf | IQ3XS | 111.80GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | grok-2-IQ3XXS.gguf | IQ3XXS | 106.96GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | grok-2-Q2KL.gguf | Q2KL | 97.61GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | grok-2-Q2K.gguf | Q2K | 96.56GB | true | Very low quality but surprisingly usable. | | grok-2-IQ2M.gguf | IQ2M | 88.21GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | grok-2-IQ2S.gguf | IQ2S | 78.71GB | true | Low quality, uses SOTA techniques to be usable. | | grok-2-IQ2XS.gguf | IQ2XS | 77.74GB | true | Low quality, uses SOTA techniques to be usable. | | grok-2-IQ2XXS.gguf | IQ2XXS | 68.52GB | true | Very low quality, uses SOTA techniques to be usable. | | grok-2-IQ1M.gguf | IQ1M | 61.38GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (xai-orggrok-2-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,626
6

CrucibleLab_M3.2-24B-Loki-V1.3-GGUF

Llamacpp imatrix Quantizations of M3.2-24B-Loki-V1.3 by CrucibleLab Original model: https://huggingface.co/CrucibleLab/M3.2-24B-Loki-V1.3 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | M3.2-24B-Loki-V1.3-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | M3.2-24B-Loki-V1.3-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | M3.2-24B-Loki-V1.3-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | M3.2-24B-Loki-V1.3-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | M3.2-24B-Loki-V1.3-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | M3.2-24B-Loki-V1.3-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | M3.2-24B-Loki-V1.3-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | M3.2-24B-Loki-V1.3-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | M3.2-24B-Loki-V1.3-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | M3.2-24B-Loki-V1.3-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | M3.2-24B-Loki-V1.3-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | M3.2-24B-Loki-V1.3-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | M3.2-24B-Loki-V1.3-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | M3.2-24B-Loki-V1.3-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | M3.2-24B-Loki-V1.3-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | M3.2-24B-Loki-V1.3-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | M3.2-24B-Loki-V1.3-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | M3.2-24B-Loki-V1.3-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | M3.2-24B-Loki-V1.3-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | M3.2-24B-Loki-V1.3-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | M3.2-24B-Loki-V1.3-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | M3.2-24B-Loki-V1.3-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | M3.2-24B-Loki-V1.3-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | M3.2-24B-Loki-V1.3-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | M3.2-24B-Loki-V1.3-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | M3.2-24B-Loki-V1.3-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (CrucibleLabM3.2-24B-Loki-V1.3-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,624
5

LLAMA-3_8B_Unaligned_BETA-GGUF

NaNK
base_model:SicariusSicariiStuff/LLAMA-3_8B_Unaligned_BETA
1,622
10

Qwen2.5-32B-AGI-GGUF

NaNK
license:apache-2.0
1,619
41

Athene-V2-Agent-GGUF

1,612
9

Qwen2-VL-72B-Instruct-GGUF

NaNK
1,607
11

OpenGVLab_InternVL3_5-8B-GGUF

NaNK
1,603
5

Falcon3-10B-Instruct-GGUF

NaNK
1,602
3

Qwen_Qwen3-VL-2B-Thinking-GGUF

Llamacpp imatrix Quantizations of Qwen3-VL-2B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-2B-Thinking-bf16.gguf | bf16 | 3.45GB | false | Full BF16 weights. | | Qwen3-VL-2B-Thinking-Q80.gguf | Q80 | 1.83GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-2B-Thinking-Q6KL.gguf | Q6KL | 1.49GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Thinking-Q6K.gguf | Q6K | 1.42GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Thinking-Q5KL.gguf | Q5KL | 1.33GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-2B-Thinking-Q5KM.gguf | Q5KM | 1.26GB | false | High quality, recommended. | | Qwen3-VL-2B-Thinking-Q5KS.gguf | Q5KS | 1.23GB | false | High quality, recommended. | | Qwen3-VL-2B-Thinking-Q4KL.gguf | Q4KL | 1.18GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-2B-Thinking-Q41.gguf | Q41 | 1.14GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-2B-Thinking-Q4KM.gguf | Q4KM | 1.11GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-2B-Thinking-Q3KXL.gguf | Q3KXL | 1.08GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Thinking-Q4KS.gguf | Q4KS | 1.06GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-2B-Thinking-Q40.gguf | Q40 | 1.06GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-2B-Thinking-IQ4NL.gguf | IQ4NL | 1.05GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-2B-Thinking-IQ4XS.gguf | IQ4XS | 1.01GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-2B-Thinking-Q3KL.gguf | Q3KL | 1.00GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Thinking-Q3KM.gguf | Q3KM | 0.94GB | false | Low quality. | | Qwen3-VL-2B-Thinking-IQ3M.gguf | IQ3M | 0.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-2B-Thinking-Q3KS.gguf | Q3KS | 0.87GB | false | Low quality, not recommended. | | Qwen3-VL-2B-Thinking-Q2KL.gguf | Q2KL | 0.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-2B-Thinking-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-2B-Thinking-Q2K.gguf | Q2K | 0.78GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-2B-Thinking-IQ3XXS.gguf | IQ3XXS | 0.75GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-2B-Thinking-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-2B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,595
1

EVA-Yi-1.5-9B-32K-V1-GGUF

NaNK
license:apache-2.0
1,590
5

Ministral-8B-Instruct-2410-HF-GGUF-TEST

NaNK
1,568
16

gemma-2-27b-it-SimPO-37K-GGUF

NaNK
1,563
11

Qwen2.5-72b-RP-Ink-GGUF

NaNK
1,560
6

gustavecortal_Beck-4B-GGUF

Llamacpp imatrix Quantizations of Beck-4B by gustavecortal Original model: https://huggingface.co/gustavecortal/Beck-4B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Beck-4B-bf16.gguf | bf16 | 8.05GB | false | Full BF16 weights. | | Beck-4B-Q80.gguf | Q80 | 4.28GB | false | Extremely high quality, generally unneeded but max available quant. | | Beck-4B-Q6KL.gguf | Q6KL | 3.40GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Beck-4B-Q6K.gguf | Q6K | 3.31GB | false | Very high quality, near perfect, recommended. | | Beck-4B-Q5KL.gguf | Q5KL | 2.98GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Beck-4B-Q5KM.gguf | Q5KM | 2.89GB | false | High quality, recommended. | | Beck-4B-Q5KS.gguf | Q5KS | 2.82GB | false | High quality, recommended. | | Beck-4B-Q41.gguf | Q41 | 2.60GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Beck-4B-Q4KL.gguf | Q4KL | 2.59GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Beck-4B-Q4KM.gguf | Q4KM | 2.50GB | false | Good quality, default size for most use cases, recommended. | | Beck-4B-Q4KS.gguf | Q4KS | 2.38GB | false | Slightly lower quality with more space savings, recommended. | | Beck-4B-Q40.gguf | Q40 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Beck-4B-IQ4NL.gguf | IQ4NL | 2.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Beck-4B-Q3KXL.gguf | Q3KXL | 2.33GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Beck-4B-IQ4XS.gguf | IQ4XS | 2.27GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Beck-4B-Q3KL.gguf | Q3KL | 2.24GB | false | Lower quality but usable, good for low RAM availability. | | Beck-4B-Q3KM.gguf | Q3KM | 2.08GB | false | Low quality. | | Beck-4B-IQ3M.gguf | IQ3M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Beck-4B-Q3KS.gguf | Q3KS | 1.89GB | false | Low quality, not recommended. | | Beck-4B-IQ3XS.gguf | IQ3XS | 1.81GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Beck-4B-Q2KL.gguf | Q2KL | 1.76GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Beck-4B-IQ3XXS.gguf | IQ3XXS | 1.67GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Beck-4B-Q2K.gguf | Q2K | 1.67GB | false | Very low quality but surprisingly usable. | | Beck-4B-IQ2M.gguf | IQ2M | 1.51GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (gustavecortalBeck-4B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,557
1

ai21labs_AI21-Jamba-Mini-1.7-GGUF

Llamacpp imatrix Quantizations of AI21-Jamba-Mini-1.7 by ai21labs Original model: https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.7 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | AI21-Jamba-Mini-1.7-bf16.gguf | bf16 | 103.16GB | true | Full BF16 weights. | | AI21-Jamba-Mini-1.7-Q80.gguf | Q80 | 54.81GB | true | Extremely high quality, generally unneeded but max available quant. | | AI21-Jamba-Mini-1.7-Q6KL.gguf | Q6KL | 42.46GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | AI21-Jamba-Mini-1.7-Q6K.gguf | Q6K | 42.33GB | false | Very high quality, near perfect, recommended. | | AI21-Jamba-Mini-1.7-Q5KL.gguf | Q5KL | 36.75GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | AI21-Jamba-Mini-1.7-Q5KM.gguf | Q5KM | 36.58GB | false | High quality, recommended. | | AI21-Jamba-Mini-1.7-Q5KS.gguf | Q5KS | 35.52GB | false | High quality, recommended. | | AI21-Jamba-Mini-1.7-Q41.gguf | Q41 | 32.32GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | AI21-Jamba-Mini-1.7-Q4KL.gguf | Q4KL | 31.38GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | AI21-Jamba-Mini-1.7-Q4KM.gguf | Q4KM | 31.18GB | false | Good quality, default size for most use cases, recommended. | | AI21-Jamba-Mini-1.7-Q4KS.gguf | Q4KS | 30.07GB | false | Slightly lower quality with more space savings, recommended. | | AI21-Jamba-Mini-1.7-Q40.gguf | Q40 | 29.59GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | AI21-Jamba-Mini-1.7-IQ4NL.gguf | IQ4NL | 29.12GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | AI21-Jamba-Mini-1.7-IQ4XS.gguf | IQ4XS | 27.52GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | AI21-Jamba-Mini-1.7-Q3KXL.gguf | Q3KXL | 24.72GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Mini-1.7-Q3KL.gguf | Q3KL | 24.48GB | false | Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Mini-1.7-Q3KM.gguf | Q3KM | 23.45GB | false | Low quality. | | AI21-Jamba-Mini-1.7-IQ3M.gguf | IQ3M | 23.33GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | AI21-Jamba-Mini-1.7-Q3KS.gguf | Q3KS | 22.32GB | false | Low quality, not recommended. | | AI21-Jamba-Mini-1.7-IQ3XS.gguf | IQ3XS | 21.19GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | AI21-Jamba-Mini-1.7-IQ3XXS.gguf | IQ3XXS | 20.24GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | AI21-Jamba-Mini-1.7-Q2KL.gguf | Q2KL | 18.24GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | AI21-Jamba-Mini-1.7-Q2K.gguf | Q2K | 17.98GB | false | Very low quality but surprisingly usable. | | AI21-Jamba-Mini-1.7-IQ2M.gguf | IQ2M | 16.24GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | AI21-Jamba-Mini-1.7-IQ2S.gguf | IQ2S | 14.41GB | false | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Mini-1.7-IQ2XS.gguf | IQ2XS | 14.34GB | false | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Mini-1.7-IQ2XXS.gguf | IQ2XXS | 12.48GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ai21labsAI21-Jamba-Mini-1.7-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,555
2

RekaAI_reka-flash-3.1-GGUF

NaNK
license:apache-2.0
1,554
1

Dolphin3.0-Qwen2.5-0.5B-GGUF

NaNK
license:apache-2.0
1,545
4

TheDrummer_Behemoth-ReduX-123B-v1.1-GGUF

Llamacpp imatrix Quantizations of Behemoth-ReduX-123B-v1.1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Behemoth-ReduX-123B-v1.1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Behemoth-ReduX-123B-v1.1-Q80.gguf | Q80 | 130.28GB | true | Extremely high quality, generally unneeded but max available quant. | | Behemoth-ReduX-123B-v1.1-Q6K.gguf | Q6K | 100.59GB | true | Very high quality, near perfect, recommended. | | Behemoth-ReduX-123B-v1.1-Q5KM.gguf | Q5KM | 86.49GB | true | High quality, recommended. | | Behemoth-ReduX-123B-v1.1-Q5KS.gguf | Q5KS | 84.36GB | true | High quality, recommended. | | Behemoth-ReduX-123B-v1.1-Q41.gguf | Q41 | 76.72GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Behemoth-ReduX-123B-v1.1-Q4KL.gguf | Q4KL | 73.52GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | Behemoth-ReduX-123B-v1.1-Q4KM.gguf | Q4KM | 73.22GB | true | Good quality, default size for most use cases, recommended. | | Behemoth-ReduX-123B-v1.1-Q4KS.gguf | Q4KS | 69.57GB | true | Slightly lower quality with more space savings, recommended. | | Behemoth-ReduX-123B-v1.1-Q40.gguf | Q40 | 69.32GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Behemoth-ReduX-123B-v1.1-IQ4NL.gguf | IQ4NL | 69.22GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Behemoth-ReduX-123B-v1.1-IQ4XS.gguf | IQ4XS | 65.43GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | Behemoth-ReduX-123B-v1.1-Q3KXL.gguf | Q3KXL | 64.91GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Behemoth-ReduX-123B-v1.1-Q3KL.gguf | Q3KL | 64.55GB | true | Lower quality but usable, good for low RAM availability. | | Behemoth-ReduX-123B-v1.1-Q3KM.gguf | Q3KM | 59.10GB | true | Low quality. | | Behemoth-ReduX-123B-v1.1-IQ3M.gguf | IQ3M | 55.28GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | Behemoth-ReduX-123B-v1.1-Q3KS.gguf | Q3KS | 52.85GB | true | Low quality, not recommended. | | Behemoth-ReduX-123B-v1.1-IQ3XS.gguf | IQ3XS | 50.14GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | Behemoth-ReduX-123B-v1.1-IQ3XXS.gguf | IQ3XXS | 47.01GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Behemoth-ReduX-123B-v1.1-Q2KL.gguf | Q2KL | 45.59GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Behemoth-ReduX-123B-v1.1-Q2K.gguf | Q2K | 45.20GB | false | Very low quality but surprisingly usable. | | Behemoth-ReduX-123B-v1.1-IQ2M.gguf | IQ2M | 41.62GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Behemoth-ReduX-123B-v1.1-IQ2S.gguf | IQ2S | 38.38GB | false | Low quality, uses SOTA techniques to be usable. | | Behemoth-ReduX-123B-v1.1-IQ2XS.gguf | IQ2XS | 36.08GB | false | Low quality, uses SOTA techniques to be usable. | | Behemoth-ReduX-123B-v1.1-IQ2XXS.gguf | IQ2XXS | 32.43GB | false | Very low quality, uses SOTA techniques to be usable. | | Behemoth-ReduX-123B-v1.1-IQ1M.gguf | IQ1M | 28.39GB | false | Extremely low quality, not recommended. | | Behemoth-ReduX-123B-v1.1-IQ1S.gguf | IQ1S | 25.96GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerBehemoth-ReduX-123B-v1.1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,543
1

glm-4-9b-chat-abliterated-GGUF

NaNK
1,541
13

Mistral-Large-Instruct-2407-GGUF

1,540
35

Pantheon-RP-1.6.1-12b-Nemo-GGUF

NaNK
license:apache-2.0
1,523
5

TheDrummer_Fallen-Gemma3-4B-v1-GGUF

NaNK
1,523
4

Meta-Llama-3.1-8B-Instruct-abliterated-GGUF

NaNK
base_model:mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated
1,519
9

Llama3.2-3B-ShiningValiant2-GGUF

NaNK
llama
1,515
0

Delta-Vector_MS3.2-Austral-Winton-GGUF

NaNK
license:apache-2.0
1,513
0

DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0-GGUF

NaNK
license:mit
1,512
7

LatitudeGames_Wayfarer-2-12B-GGUF

Llamacpp imatrix Quantizations of Wayfarer-2-12B by LatitudeGames Original model: https://huggingface.co/LatitudeGames/Wayfarer-2-12B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Wayfarer-2-12B-bf16.gguf | bf16 | 24.50GB | false | Full BF16 weights. | | Wayfarer-2-12B-Q80.gguf | Q80 | 13.02GB | false | Extremely high quality, generally unneeded but max available quant. | | Wayfarer-2-12B-Q6KL.gguf | Q6KL | 10.38GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Wayfarer-2-12B-Q6K.gguf | Q6K | 10.06GB | false | Very high quality, near perfect, recommended. | | Wayfarer-2-12B-Q5KL.gguf | Q5KL | 9.14GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Wayfarer-2-12B-Q5KM.gguf | Q5KM | 8.73GB | false | High quality, recommended. | | Wayfarer-2-12B-Q5KS.gguf | Q5KS | 8.52GB | false | High quality, recommended. | | Wayfarer-2-12B-Q4KL.gguf | Q4KL | 7.98GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Wayfarer-2-12B-Q41.gguf | Q41 | 7.80GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Wayfarer-2-12B-Q4KM.gguf | Q4KM | 7.48GB | false | Good quality, default size for most use cases, recommended. | | Wayfarer-2-12B-Q3KXL.gguf | Q3KXL | 7.15GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Wayfarer-2-12B-Q4KS.gguf | Q4KS | 7.12GB | false | Slightly lower quality with more space savings, recommended. | | Wayfarer-2-12B-IQ4NL.gguf | IQ4NL | 7.10GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Wayfarer-2-12B-Q40.gguf | Q40 | 7.09GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Wayfarer-2-12B-IQ4XS.gguf | IQ4XS | 6.74GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Wayfarer-2-12B-Q3KL.gguf | Q3KL | 6.56GB | false | Lower quality but usable, good for low RAM availability. | | Wayfarer-2-12B-Q3KM.gguf | Q3KM | 6.08GB | false | Low quality. | | Wayfarer-2-12B-IQ3M.gguf | IQ3M | 5.72GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Wayfarer-2-12B-Q3KS.gguf | Q3KS | 5.53GB | false | Low quality, not recommended. | | Wayfarer-2-12B-Q2KL.gguf | Q2KL | 5.45GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Wayfarer-2-12B-IQ3XS.gguf | IQ3XS | 5.31GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Wayfarer-2-12B-IQ3XXS.gguf | IQ3XXS | 4.95GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Wayfarer-2-12B-Q2K.gguf | Q2K | 4.79GB | false | Very low quality but surprisingly usable. | | Wayfarer-2-12B-IQ2M.gguf | IQ2M | 4.44GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Wayfarer-2-12B-IQ2S.gguf | IQ2S | 4.14GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (LatitudeGamesWayfarer-2-12B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,509
2

Palmyra-Med-70B-32K-GGUF

NaNK
1,503
3

Llama-Song-Stream-3B-Instruct-GGUF

NaNK
Llama3.2
1,502
2

TildeAI_TildeOpen-30b-GGUF

NaNK
1,500
3

huihui-ai_Qwen3-14B-abliterated-GGUF

NaNK
license:apache-2.0
1,498
8

WizardLM-2-8x22B-GGUF

NaNK
license:apache-2.0
1,494
12

Skywork_Skywork-R1V3-38B-GGUF

NaNK
license:mit
1,488
6

Delta-Vector_Plesio-70B-GGUF

NaNK
llama
1,469
0

deepthought-8b-llama-v0.01-alpha-GGUF

NaNK
base_model:ruliad/deepthought-8b-llama-v0.01-alpha
1,468
4

huihui-ai_gemma-3-1b-it-abliterated-GGUF

NaNK
1,468
4

aws-prototyping_codefu-7b-v0.1-GGUF

NaNK
1,459
1

LGAI-EXAONE_EXAONE-4.0-32B-GGUF

NaNK
1,452
3

Llama-3.1-8B-ArliAI-RPMax-v1.1-GGUF

NaNK
base_model:ArliAI/Llama-3.1-8B-ArliAI-RPMax-v1.1
1,450
1

writing-roleplay-20k-context-nemo-12b-v1.0-GGUF

NaNK
1,449
12

Gryphe_Pantheon-Proto-RP-1.8-30B-A3B-GGUF

NaNK
license:apache-2.0
1,446
9

Phi-3-medium-4k-instruct-GGUF

license:mit
1,441
37

zerofata_MS3.2-PaintedFantasy-Visage-33B-GGUF

NaNK
license:apache-2.0
1,435
3

Rombo-Org_Rombo-LLM-V3.0-Qwen-32b-GGUF

NaNK
license:apache-2.0
1,435
0

Menlo_Jan-nano-GGUF

license:apache-2.0
1,416
8

Vikhr-Nemo-12B-Instruct-R-21-09-24-GGUF

NaNK
license:apache-2.0
1,412
14

TheDrummer_Behemoth-ReduX-123B-v1-GGUF

NaNK
1,412
3

Replete-Coder-V2-Llama-3.1-8b-GGUF

NaNK
license:apache-2.0
1,412
2

kalomaze_Qwen3-16B-A3B-GGUF

NaNK
license:apache-2.0
1,411
3

microsoft_Phi-4-mini-reasoning-GGUF

license:mit
1,401
11

DeepSeek-V2.5-GGUF

NaNK
1,398
42

nvidia_OpenReasoning-Nemotron-14B-GGUF

NaNK
1,393
4

inclusionAI_Ling-1T-GGUF

Llamacpp imatrix Quantizations of Ling-1T by inclusionAI Original model: https://huggingface.co/inclusionAI/Ling-1T All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Ling-1T-Q41.gguf | Q41 | 626.57GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Ling-1T-Q4KM.gguf | Q4KM | 608.42GB | true | Good quality, default size for most use cases, recommended. | | Ling-1T-Q40.gguf | Q40 | 574.66GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Ling-1T-IQ4NL.gguf | IQ4NL | 565.09GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Ling-1T-IQ4XS.gguf | IQ4XS | 534.18GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | Ling-1T-Q3KXL.gguf | Q3KXL | 476.60GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Ling-1T-Q3KL.gguf | Q3KL | 475.47GB | true | Lower quality but usable, good for low RAM availability. | | Ling-1T-Q3KM.gguf | Q3KM | 456.46GB | true | Low quality. | | Ling-1T-IQ3M.gguf | IQ3M | 456.38GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | Ling-1T-Q3KS.gguf | Q3KS | 433.73GB | true | Low quality, not recommended. | | Ling-1T-IQ3XS.gguf | IQ3XS | 409.57GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | Ling-1T-IQ3XXS.gguf | IQ3XXS | 394.91GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | Ling-1T-Q2KL.gguf | Q2KL | 351.17GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Ling-1T-Q2K.gguf | Q2K | 349.92GB | true | Very low quality but surprisingly usable. | | Ling-1T-IQ2M.gguf | IQ2M | 316.09GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Ling-1T-IQ2S.gguf | IQ2S | 277.97GB | true | Low quality, uses SOTA techniques to be usable. | | Ling-1T-IQ2XS.gguf | IQ2XS | 277.17GB | true | Low quality, uses SOTA techniques to be usable. | | Ling-1T-IQ2XXS.gguf | IQ2XXS | 240.46GB | true | Very low quality, uses SOTA techniques to be usable. | | Ling-1T-IQ1M.gguf | IQ1M | 215.36GB | true | Extremely low quality, not recommended. | | Ling-1T-IQ1S.gguf | IQ1S | 206.10GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (inclusionAILing-1T-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

license:mit
1,393
0

yanolja_YanoljaNEXT-Rosetta-12B-2510-GGUF

Llamacpp imatrix Quantizations of YanoljaNEXT-Rosetta-12B-2510 by yanolja Original model: https://huggingface.co/yanolja/YanoljaNEXT-Rosetta-12B-2510 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | YanoljaNEXT-Rosetta-12B-2510-bf16.gguf | bf16 | 25.55GB | false | Full BF16 weights. | | YanoljaNEXT-Rosetta-12B-2510-Q80.gguf | Q80 | 13.58GB | false | Extremely high quality, generally unneeded but max available quant. | | YanoljaNEXT-Rosetta-12B-2510-Q6KL.gguf | Q6KL | 10.97GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q6K.gguf | Q6K | 10.49GB | false | Very high quality, near perfect, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q5KL.gguf | Q5KL | 9.76GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q5KM.gguf | Q5KM | 9.14GB | false | High quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q5KS.gguf | Q5KS | 8.92GB | false | High quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q4KL.gguf | Q4KL | 8.61GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q41.gguf | Q41 | 8.19GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | YanoljaNEXT-Rosetta-12B-2510-Q4KM.gguf | Q4KM | 7.87GB | false | Good quality, default size for most use cases, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q3KXL.gguf | Q3KXL | 7.79GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | YanoljaNEXT-Rosetta-12B-2510-Q4KS.gguf | Q4KS | 7.50GB | false | Slightly lower quality with more space savings, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q40.gguf | Q40 | 7.48GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | YanoljaNEXT-Rosetta-12B-2510-IQ4NL.gguf | IQ4NL | 7.45GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | YanoljaNEXT-Rosetta-12B-2510-IQ4XS.gguf | IQ4XS | 7.09GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q3KL.gguf | Q3KL | 6.91GB | false | Lower quality but usable, good for low RAM availability. | | YanoljaNEXT-Rosetta-12B-2510-Q3KM.gguf | Q3KM | 6.44GB | false | Low quality. | | YanoljaNEXT-Rosetta-12B-2510-IQ3M.gguf | IQ3M | 6.09GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | YanoljaNEXT-Rosetta-12B-2510-Q2KL.gguf | Q2KL | 6.08GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | YanoljaNEXT-Rosetta-12B-2510-Q3KS.gguf | Q3KS | 5.89GB | false | Low quality, not recommended. | | YanoljaNEXT-Rosetta-12B-2510-IQ3XS.gguf | IQ3XS | 5.64GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | YanoljaNEXT-Rosetta-12B-2510-IQ3XXS.gguf | IQ3XXS | 5.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | YanoljaNEXT-Rosetta-12B-2510-Q2K.gguf | Q2K | 5.10GB | false | Very low quality but surprisingly usable. | | YanoljaNEXT-Rosetta-12B-2510-IQ2M.gguf | IQ2M | 4.74GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | YanoljaNEXT-Rosetta-12B-2510-IQ2S.gguf | IQ2S | 4.45GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (yanoljaYanoljaNEXT-Rosetta-12B-2510-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,392
2

TQ2.5-14B-Sugarquill-v1-GGUF

NaNK
license:apache-2.0
1,392
2

Gryphe_Pantheon-RP-1.8-24b-Small-3.1-GGUF

NaNK
license:apache-2.0
1,387
18

dolphin-2.9.4-gemma2-2b-GGUF

NaNK
1,385
13

Llama-Sentient-3.2-3B-Instruct-GGUF

NaNK
Llama
1,383
3

all-hands_openhands-lm-32b-v0.1-GGUF

NaNK
license:mit
1,371
40

Qwen2.5-Coder-3B-Instruct-abliterated-GGUF

NaNK
1,368
4

google_gemma-3-1b-it-qat-GGUF

NaNK
1,363
6

zerofata_L3.3-GeneticLemonade-Opus-70B-GGUF

NaNK
license:llama3
1,362
1

Yi-Coder-9B-Chat-GGUF

NaNK
license:apache-2.0
1,357
19

granite-3.1-2b-instruct-GGUF

NaNK
license:apache-2.0
1,351
4

Qwen2-VL-7B-Instruct-abliterated-GGUF

NaNK
license:apache-2.0
1,348
11

Llama3.1-8B-Cobalt-GGUF

NaNK
llama
1,347
2

mistralai_Magistral-Small-2507-GGUF

1,343
10

Qwen2.5-7B-Instruct-1M-GGUF

NaNK
license:apache-2.0
1,340
36

Anubis-70B-v1-GGUF

NaNK
1,339
13

Mistral-Nemo-Prism-12B-GGUF

NaNK
license:apache-2.0
1,339
1

microsoft_Phi-4-reasoning-plus-GGUF

license:mit
1,332
11

TheDrummer_Fallen-Llama-3.3-R1-70B-v1-GGUF

NaNK
base_model:TheDrummer/Fallen-Llama-3.3-R1-70B-v1
1,327
6

google_gemma-3-270m-it-qat-GGUF

1,326
1

Sky-T1-32B-Preview-GGUF

NaNK
1,314
82

Dans-PersonalityEngine-v1.0.0-8b-GGUF

NaNK
license:apache-2.0
1,314
1

Rombos-Coder-V2.5-Qwen-14b-GGUF

NaNK
license:apache-2.0
1,308
7

Falcon3-1B-Instruct-GGUF

NaNK
1,301
1

EVA-LLaMA-3.33-70B-v0.1-GGUF

NaNK
base_model:EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.1
1,301
0

LGAI-EXAONE_EXAONE-Deep-32B-GGUF

NaNK
1,296
3

granite-3.1-3b-a800m-instruct-GGUF

NaNK
license:apache-2.0
1,294
7

deepcogito_cogito-v1-preview-qwen-14B-GGUF

NaNK
license:apache-2.0
1,291
12

INTELLECT-1-Instruct-GGUF

NaNK
dataset:arcee-ai/Llama-405B-Logits
1,289
6

soob3123_amoral-gemma3-12B-GGUF

NaNK
license:apache-2.0
1,286
8

InfiniAILab_QwQ-0.5B-GGUF

NaNK
1,282
1

Steelskull_L3.3-Cu-Mai-R1-70b-GGUF

NaNK
license:llama3.3
1,281
2

ibm-granite_granite-3.2-2b-instruct-GGUF

NaNK
license:apache-2.0
1,279
4

google_txgemma-9b-chat-GGUF

NaNK
1,279
3

Steelskull_L3.3-Shakudo-70b-GGUF

NaNK
license:llama3.3
1,278
3

Qwentile2.5-32B-Instruct-GGUF

NaNK
license:apache-2.0
1,277
5

arcee-ai_Arcee-Maestro-7B-Preview-GGUF

NaNK
license:apache-2.0
1,271
6

Llama-3.1-8B-Lexi-Uncensored-GGUF

NaNK
base_model:Orenguteng/Llama-3.1-8B-Lexi-Uncensored
1,271
4

NousResearch_DeepHermes-3-Mistral-24B-Preview-GGUF

NaNK
license:apache-2.0
1,270
10

smirki_UIGEN-T1.1-Qwen-14B-GGUF

NaNK
license:apache-2.0
1,266
2

c4ai-command-r-08-2024-GGUF

license:cc-by-nc-4.0
1,264
49

THU-KEG_LongWriter-Zero-32B-GGUF

NaNK
license:apache-2.0
1,264
4

arcee-ai_AFM-4.5B-GGUF

NaNK
license:apache-2.0
1,261
3

soob3123_Veritas-12B-GGUF

NaNK
1,261
2

nvidia_Llama-3_3-Nemotron-Super-49B-GenRM-Multilingual-GGUF

NaNK
llama3.3
1,260
2

Mistral-Large-Instruct-2411-GGUF

1,257
29

AutoCoder-GGUF

license:apache-2.0
1,253
7

Gemma-2-9B-It-SPPO-Iter3-GGUF

NaNK
1,252
56

Ichigo-llama3.1-s-instruct-v0.4-GGUF

NaNK
base_model:Menlo/Ichigo-llama3.1-s-instruct-v0.4
1,248
2

Nemotron-Mini-4B-Instruct-GGUF

NaNK
1,247
15

PocketDoc_Dans-SakuraKaze-V1.0.0-12b-GGUF

NaNK
license:apache-2.0
1,242
3

MS-Schisandra-22B-v0.3-GGUF

NaNK
1,239
1

open-thoughts_OpenThinker3-7B-GGUF

NaNK
llama-factory
1,235
15

Chocolatine-3B-Instruct-DPO-v1.2-GGUF

NaNK
license:mit
1,229
1

EXAONE-3.5-32B-Instruct-GGUF

NaNK
1,221
8

TheDrummer_Snowpiercer-15B-v4-GGUF

NaNK
1,215
3

Llama-OpenReviewer-8B-GGUF

NaNK
base_model:maxidl/Llama-OpenReviewer-8B
1,214
1

PKU-DS-LAB_FairyR1-32B-GGUF

NaNK
license:apache-2.0
1,211
2

gustavecortal_Beck-1.7B-GGUF

Llamacpp imatrix Quantizations of Beck-1.7B by gustavecortal Original model: https://huggingface.co/gustavecortal/Beck-1.7B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Beck-1.7B-bf16.gguf | bf16 | 3.45GB | false | Full BF16 weights. | | Beck-1.7B-Q80.gguf | Q80 | 1.83GB | false | Extremely high quality, generally unneeded but max available quant. | | Beck-1.7B-Q6KL.gguf | Q6KL | 1.49GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Beck-1.7B-Q6K.gguf | Q6K | 1.42GB | false | Very high quality, near perfect, recommended. | | Beck-1.7B-Q5KL.gguf | Q5KL | 1.33GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Beck-1.7B-Q5KM.gguf | Q5KM | 1.26GB | false | High quality, recommended. | | Beck-1.7B-Q5KS.gguf | Q5KS | 1.23GB | false | High quality, recommended. | | Beck-1.7B-Q4KL.gguf | Q4KL | 1.18GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Beck-1.7B-Q41.gguf | Q41 | 1.14GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Beck-1.7B-Q4KM.gguf | Q4KM | 1.11GB | false | Good quality, default size for most use cases, recommended. | | Beck-1.7B-Q3KXL.gguf | Q3KXL | 1.08GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Beck-1.7B-Q4KS.gguf | Q4KS | 1.06GB | false | Slightly lower quality with more space savings, recommended. | | Beck-1.7B-Q40.gguf | Q40 | 1.06GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Beck-1.7B-IQ4NL.gguf | IQ4NL | 1.05GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Beck-1.7B-IQ4XS.gguf | IQ4XS | 1.01GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Beck-1.7B-Q3KL.gguf | Q3KL | 1.00GB | false | Lower quality but usable, good for low RAM availability. | | Beck-1.7B-Q3KM.gguf | Q3KM | 0.94GB | false | Low quality. | | Beck-1.7B-IQ3M.gguf | IQ3M | 0.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Beck-1.7B-Q3KS.gguf | Q3KS | 0.87GB | false | Low quality, not recommended. | | Beck-1.7B-Q2KL.gguf | Q2KL | 0.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Beck-1.7B-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Beck-1.7B-Q2K.gguf | Q2K | 0.78GB | false | Very low quality but surprisingly usable. | | Beck-1.7B-IQ3XXS.gguf | IQ3XXS | 0.75GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Beck-1.7B-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (gustavecortalBeck-1.7B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,211
0

CohereForAI_c4ai-command-a-03-2025-GGUF

license:cc-by-nc-4.0
1,210
15

Hermes-3-Llama-3.1-8B-lorablated-GGUF

NaNK
base_model:mlabonne/Hermes-3-Llama-3.1-8B-lorablated
1,210
9

Delta-Vector_Austral-Xgen-9B-Winton-GGUF

NaNK
license:apache-2.0
1,206
1

Reflection-Llama-3.1-70B-GGUF

NaNK
base_model:mattshumer/Reflection-Llama-3.1-70B
1,204
53

EXAONE-3.5-7.8B-Instruct-GGUF

NaNK
1,204
16

TheDrummer_Rivermind-Lux-12B-v1-GGUF

NaNK
1,200
4

Steelskull_L3.3-Mokume-Gane-R1-70b-v1.1-GGUF

NaNK
license:llama3.3
1,197
1

TheDrummer_Gemma-3-R1-27B-v1-GGUF

Llamacpp imatrix Quantizations of Gemma-3-R1-27B-v1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Gemma-3-R1-27B-v1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Gemma-3-R1-27B-v1-bf16.gguf | bf16 | 54.03GB | true | Full BF16 weights. | | Gemma-3-R1-27B-v1-Q80.gguf | Q80 | 28.71GB | false | Extremely high quality, generally unneeded but max available quant. | | Gemma-3-R1-27B-v1-Q6KL.gguf | Q6KL | 22.51GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Gemma-3-R1-27B-v1-Q6K.gguf | Q6K | 22.17GB | false | Very high quality, near perfect, recommended. | | Gemma-3-R1-27B-v1-Q5KL.gguf | Q5KL | 19.61GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Gemma-3-R1-27B-v1-Q5KM.gguf | Q5KM | 19.27GB | false | High quality, recommended. | | Gemma-3-R1-27B-v1-Q5KS.gguf | Q5KS | 18.77GB | false | High quality, recommended. | | Gemma-3-R1-27B-v1-Q41.gguf | Q41 | 17.17GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Gemma-3-R1-27B-v1-Q4KL.gguf | Q4KL | 16.89GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Gemma-3-R1-27B-v1-Q4KM.gguf | Q4KM | 16.55GB | false | Good quality, default size for most use cases, recommended. | | Gemma-3-R1-27B-v1-Q4KS.gguf | Q4KS | 15.67GB | false | Slightly lower quality with more space savings, recommended. | | Gemma-3-R1-27B-v1-Q40.gguf | Q40 | 15.62GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Gemma-3-R1-27B-v1-IQ4NL.gguf | IQ4NL | 15.57GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Gemma-3-R1-27B-v1-Q3KXL.gguf | Q3KXL | 14.88GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-27B-v1-IQ4XS.gguf | IQ4XS | 14.77GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Gemma-3-R1-27B-v1-Q3KL.gguf | Q3KL | 14.54GB | false | Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-27B-v1-Q3KM.gguf | Q3KM | 13.44GB | false | Low quality. | | Gemma-3-R1-27B-v1-IQ3M.gguf | IQ3M | 12.55GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Gemma-3-R1-27B-v1-Q3KS.gguf | Q3KS | 12.17GB | false | Low quality, not recommended. | | Gemma-3-R1-27B-v1-IQ3XS.gguf | IQ3XS | 11.56GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Gemma-3-R1-27B-v1-Q2KL.gguf | Q2KL | 10.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Gemma-3-R1-27B-v1-IQ3XXS.gguf | IQ3XXS | 10.72GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Gemma-3-R1-27B-v1-Q2K.gguf | Q2K | 10.50GB | false | Very low quality but surprisingly usable. | | Gemma-3-R1-27B-v1-IQ2M.gguf | IQ2M | 9.49GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Gemma-3-R1-27B-v1-IQ2S.gguf | IQ2S | 8.78GB | false | Low quality, uses SOTA techniques to be usable. | | Gemma-3-R1-27B-v1-IQ2XS.gguf | IQ2XS | 8.44GB | false | Low quality, uses SOTA techniques to be usable. | | Gemma-3-R1-27B-v1-IQ2XXS.gguf | IQ2XXS | 7.69GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerGemma-3-R1-27B-v1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,195
5

CollectiveLM-Falcon-3-7B-GGUF

NaNK
1,191
0

LiquidAI_LFM2-VL-1.6B-GGUF

NaNK
1,188
2

deepseek-ai_DeepSeek-V3-0324-GGUF

NaNK
license:mit
1,187
26

Qwen2.5-Coder-14B-Instruct-abliterated-GGUF

NaNK
license:apache-2.0
1,185
15

nvidia_OpenCodeReasoning-Nemotron-32B-IOI-GGUF

NaNK
license:apache-2.0
1,184
2

Qwen2.5-Coder-7B-Instruct-abliterated-GGUF

NaNK
license:apache-2.0
1,183
5

Marco-o1-GGUF

NaNK
license:apache-2.0
1,178
46

SuperNova-Medius-GGUF

license:apache-2.0
1,178
40

Qwen_Qwen2.5-VL-72B-Instruct-GGUF

NaNK
1,172
1

Cydonia-22B-v1-GGUF

NaNK
1,170
13

Rombos-Coder-V2.5-Qwen-32b-GGUF

NaNK
license:apache-2.0
1,170
5

microsoft_Phi-4-reasoning-GGUF

license:mit
1,168
4

UI-TARS-7B-SFT-GGUF

NaNK
license:apache-2.0
1,166
3

OpenGVLab_InternVL3_5-2B-GGUF

NaNK
1,166
3

Qwen2.5-Coder-32B-Instruct-abliterated-GGUF

NaNK
license:apache-2.0
1,160
28

openai_gpt-oss-120b-GGUF-MXFP4-Experimental

Llamacpp experimental Quantizations of gpt-oss-120b by Open AI Using llama.cpp branch `gpt-oss-mxfp4`, PR here: https://github.com/ggml-org/llama.cpp/pull/15091 Original model: https://huggingface.co/openai/gpt-oss-120b This is a single static quant in the new MXFP4 format, rest of sizes will come after PR is merged

NaNK
1,159
6

Qwen2-0.5B-Instruct-GGUF

NaNK
license:apache-2.0
1,158
2

SILMA-9B-Instruct-v1.0-GGUF

NaNK
1,156
1

arcee-ai_Homunculus-GGUF

license:apache-2.0
1,152
5

WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-GGUF

NaNK
license:apache-2.0
1,149
4

TheDrummer_Gemma-3-R1-12B-v1-GGUF

Llamacpp imatrix Quantizations of Gemma-3-R1-12B-v1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Gemma-3-R1-12B-v1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Gemma-3-R1-12B-v1-bf16.gguf | bf16 | 23.54GB | false | Full BF16 weights. | | Gemma-3-R1-12B-v1-Q80.gguf | Q80 | 12.51GB | false | Extremely high quality, generally unneeded but max available quant. | | Gemma-3-R1-12B-v1-Q6KL.gguf | Q6KL | 9.90GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Gemma-3-R1-12B-v1-Q6K.gguf | Q6K | 9.66GB | false | Very high quality, near perfect, recommended. | | Gemma-3-R1-12B-v1-Q5KL.gguf | Q5KL | 8.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Gemma-3-R1-12B-v1-Q5KM.gguf | Q5KM | 8.45GB | false | High quality, recommended. | | Gemma-3-R1-12B-v1-Q5KS.gguf | Q5KS | 8.23GB | false | High quality, recommended. | | Gemma-3-R1-12B-v1-Q41.gguf | Q41 | 7.56GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Gemma-3-R1-12B-v1-Q4KL.gguf | Q4KL | 7.54GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Gemma-3-R1-12B-v1-Q4KM.gguf | Q4KM | 7.30GB | false | Good quality, default size for most use cases, recommended. | | Gemma-3-R1-12B-v1-Q4KS.gguf | Q4KS | 6.94GB | false | Slightly lower quality with more space savings, recommended. | | Gemma-3-R1-12B-v1-Q40.gguf | Q40 | 6.91GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Gemma-3-R1-12B-v1-IQ4NL.gguf | IQ4NL | 6.89GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Gemma-3-R1-12B-v1-Q3KXL.gguf | Q3KXL | 6.72GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-12B-v1-IQ4XS.gguf | IQ4XS | 6.55GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Gemma-3-R1-12B-v1-Q3KL.gguf | Q3KL | 6.48GB | false | Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-12B-v1-Q3KM.gguf | Q3KM | 6.01GB | false | Low quality. | | Gemma-3-R1-12B-v1-IQ3M.gguf | IQ3M | 5.66GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Gemma-3-R1-12B-v1-Q3KS.gguf | Q3KS | 5.46GB | false | Low quality, not recommended. | | Gemma-3-R1-12B-v1-IQ3XS.gguf | IQ3XS | 5.21GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Gemma-3-R1-12B-v1-Q2KL.gguf | Q2KL | 5.01GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Gemma-3-R1-12B-v1-IQ3XXS.gguf | IQ3XXS | 4.78GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Gemma-3-R1-12B-v1-Q2K.gguf | Q2K | 4.77GB | false | Very low quality but surprisingly usable. | | Gemma-3-R1-12B-v1-IQ2M.gguf | IQ2M | 4.31GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Gemma-3-R1-12B-v1-IQ2S.gguf | IQ2S | 4.02GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerGemma-3-R1-12B-v1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
1,148
5

infly_inf-o1-pi0-GGUF

NaNK
1,147
2

magnum-v4-22b-GGUF

NaNK
1,144
5

magnum-v2-4b-GGUF

NaNK
license:apache-2.0
1,143
6

Llama-3-Patronus-Lynx-70B-Instruct-GGUF

NaNK
base_model:PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct
1,137
0

YuLan-Mini-GGUF

license:mit
1,135
5

starcoder2-15b-instruct-v0.1-GGUF

NaNK
1,132
8

35b-beta-long-GGUF

NaNK
license:gpl-3.0
1,125
30

Deepthink-Reasoning-7B-GGUF

NaNK
1,125
3

Llama-Doctor-3.2-3B-Instruct-GGUF

NaNK
Llama-3.2
1,124
2

deepcogito_cogito-v1-preview-llama-70B-GGUF

NaNK
base_model:deepcogito/cogito-v1-preview-llama-70B
1,123
2

OpenThinker-7B-GGUF

NaNK
llama-factory
1,110
6

Llama-3.1-8B-Open-SFT-GGUF

NaNK
Llama3.1
1,102
0

Nohobby_L3.3-Prikol-70B-EXTRA-GGUF

NaNK
1,095
0

Sao10K_Llama-3.3-70B-Vulpecula-r1-GGUF

NaNK
base_model:Sao10K/Llama-3.3-70B-Vulpecula-r1
1,091
5

granite-embedding-125m-english-GGUF

license:apache-2.0
1,086
0

Behemoth-123B-v1-GGUF

NaNK
1,084
4

LongWriter-glm4-9b-abliterated-GGUF

NaNK
llama
1,082
2

Athene-70B-GGUF

NaNK
license:cc-by-nc-4.0
1,071
7

SmolLM2-360M-Instruct-GGUF

license:apache-2.0
1,071
5

deepseek-ai_DeepSeek-V3.1-Terminus-GGUF

Llamacpp imatrix Quantizations of DeepSeek-V3.1-Terminus by deepseek-ai Original model: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-V3.1-Terminus-Q80.gguf | Q80 | 713.29GB | true | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-V3.1-Terminus-Q6K.gguf | Q6K | 552.45GB | true | Very high quality, near perfect, recommended. | | DeepSeek-V3.1-Terminus-Q5KM.gguf | Q5KM | 478.34GB | true | High quality, recommended. | | DeepSeek-V3.1-Terminus-Q5KS.gguf | Q5KS | 463.03GB | true | High quality, recommended. | | DeepSeek-V3.1-Terminus-Q41.gguf | Q41 | 421.04GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-V3.1-Terminus-Q4KM.gguf | Q4KM | 409.23GB | true | Good quality, default size for most use cases, recommended. | | DeepSeek-V3.1-Terminus-Q4KS.gguf | Q4KS | 394.15GB | true | Slightly lower quality with more space savings, recommended. | | DeepSeek-V3.1-Terminus-Q40.gguf | Q40 | 386.42GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-V3.1-Terminus-IQ4NL.gguf | IQ4NL | 380.48GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-V3.1-Terminus-IQ4XS.gguf | IQ4XS | 359.98GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-V3.1-Terminus-Q3KXL.gguf | Q3KXL | 320.52GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-V3.1-Terminus-Q3KL.gguf | Q3KL | 319.71GB | true | Lower quality but usable, good for low RAM availability. | | DeepSeek-V3.1-Terminus-Q3KM.gguf | Q3KM | 307.93GB | true | Low quality. | | DeepSeek-V3.1-Terminus-IQ3M.gguf | IQ3M | 307.88GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-V3.1-Terminus-Q3KS.gguf | Q3KS | 293.35GB | true | Low quality, not recommended. | | DeepSeek-V3.1-Terminus-IQ3XS.gguf | IQ3XS | 277.15GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-V3.1-Terminus-IQ3XXS.gguf | IQ3XXS | 267.63GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | DeepSeek-V3.1-Terminus-Q2KL.gguf | Q2KL | 238.74GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-V3.1-Terminus-Q2K.gguf | Q2K | 237.83GB | true | Very low quality but surprisingly usable. | | DeepSeek-V3.1-Terminus-IQ2M.gguf | IQ2M | 215.04GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-V3.1-Terminus-IQ2S.gguf | IQ2S | 189.63GB | true | Low quality, uses SOTA techniques to be usable. | | DeepSeek-V3.1-Terminus-IQ2XS.gguf | IQ2XS | 188.41GB | true | Low quality, uses SOTA techniques to be usable. | | DeepSeek-V3.1-Terminus-IQ2XXS.gguf | IQ2XXS | 164.06GB | true | Very low quality, uses SOTA techniques to be usable. | | DeepSeek-V3.1-Terminus-IQ1M.gguf | IQ1M | 147.45GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (deepseek-aiDeepSeek-V3.1-Terminus-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

1,068
2

Rombos-Coder-V2.5-Qwen-7b-GGUF

NaNK
license:apache-2.0
1,067
2

Crimson_Dawn-v0.2-GGUF

license:apache-2.0
1,065
2

Hermes-3-Llama-3.1-8B-GGUF

NaNK
Llama-3
1,054
11

LGAI-EXAONE_EXAONE-4.0.1-32B-GGUF

NaNK
1,049
2

EVA-Qwen2.5-32B-v0.0-GGUF

NaNK
license:apache-2.0
1,044
9

miromind-ai_MiroThinker-v1.0-72B-GGUF

NaNK
license:mit
1,034
0

Gemmasutra-Mini-2B-v1-GGUF

NaNK
1,030
4

internlm3-8b-instruct-GGUF

NaNK
license:apache-2.0
1,022
3

Gemma-2-Ataraxy-9B-GGUF

NaNK
1,013
28

OpenGVLab_InternVL3_5-4B-GGUF

NaNK
1,007
5

ddh0_Cassiopeia-70B-GGUF

NaNK
999
2

Tower-Babel_Babel-9B-Chat-GGUF

NaNK
998
6

TheDrummer_Cydonia-24B-v3.1-GGUF

NaNK
998
6

ByteDance-Seed_Seed-OSS-36B-Instruct-GGUF

Llamacpp imatrix Quantizations of Seed-OSS-36B-Instruct by ByteDance-Seed Original model: https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Seed-OSS-36B-Instruct-bf16.gguf | bf16 | 72.31GB | true | Full BF16 weights. | | Seed-OSS-36B-Instruct-Q80.gguf | Q80 | 38.42GB | false | Extremely high quality, generally unneeded but max available quant. | | Seed-OSS-36B-Instruct-Q6KL.gguf | Q6KL | 30.05GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Seed-OSS-36B-Instruct-Q6K.gguf | Q6K | 29.67GB | false | Very high quality, near perfect, recommended. | | Seed-OSS-36B-Instruct-Q5KL.gguf | Q5KL | 26.08GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Seed-OSS-36B-Instruct-Q5KM.gguf | Q5KM | 25.59GB | false | High quality, recommended. | | Seed-OSS-36B-Instruct-Q5KS.gguf | Q5KS | 24.97GB | false | High quality, recommended. | | Seed-OSS-36B-Instruct-Q41.gguf | Q41 | 22.76GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Seed-OSS-36B-Instruct-Q4KL.gguf | Q4KL | 22.35GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Seed-OSS-36B-Instruct-Q4KM.gguf | Q4KM | 21.76GB | false | Good quality, default size for most use cases, recommended. | | Seed-OSS-36B-Instruct-Q4KS.gguf | Q4KS | 20.70GB | false | Slightly lower quality with more space savings, recommended. | | Seed-OSS-36B-Instruct-Q40.gguf | Q40 | 20.62GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Seed-OSS-36B-Instruct-IQ4NL.gguf | IQ4NL | 20.59GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Seed-OSS-36B-Instruct-Q3KXL.gguf | Q3KXL | 19.84GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Seed-OSS-36B-Instruct-IQ4XS.gguf | IQ4XS | 19.50GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Seed-OSS-36B-Instruct-Q3KL.gguf | Q3KL | 19.14GB | false | Lower quality but usable, good for low RAM availability. | | Seed-OSS-36B-Instruct-Q3KM.gguf | Q3KM | 17.62GB | false | Low quality. | | Seed-OSS-36B-Instruct-IQ3M.gguf | IQ3M | 16.50GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Seed-OSS-36B-Instruct-Q3KS.gguf | Q3KS | 15.86GB | false | Low quality, not recommended. | | Seed-OSS-36B-Instruct-IQ3XS.gguf | IQ3XS | 15.09GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Seed-OSS-36B-Instruct-Q2KL.gguf | Q2KL | 14.38GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Seed-OSS-36B-Instruct-IQ3XXS.gguf | IQ3XXS | 14.12GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Seed-OSS-36B-Instruct-Q2K.gguf | Q2K | 13.60GB | false | Very low quality but surprisingly usable. | | Seed-OSS-36B-Instruct-IQ2M.gguf | IQ2M | 12.54GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Seed-OSS-36B-Instruct-IQ2S.gguf | IQ2S | 11.61GB | false | Low quality, uses SOTA techniques to be usable. | | Seed-OSS-36B-Instruct-IQ2XS.gguf | IQ2XS | 10.95GB | false | Low quality, uses SOTA techniques to be usable. | | Seed-OSS-36B-Instruct-IQ2XXS.gguf | IQ2XXS | 9.91GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ByteDance-SeedSeed-OSS-36B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

NaNK
995
5

Chronos-Gold-12B-1.0-GGUF

NaNK
license:cc-by-nc-4.0
993
9

dolphin-2.9.4-llama3.1-8b-GGUF

NaNK
base_model:dphn/dolphin-2.9.4-llama3.1-8b
992
10