bartowski
Llama-3.2-3B-Instruct-GGUF
--- base_model: meta-llama/Llama-3.2-3B-Instruct language: - en - de - fr - it - pt - hi - es - th license: llama3.2 pipeline_tag: text-generation tags: - facebook - meta - llama - llama-3 quantized_by: bartowski extra_gated_prompt: "### LLAMA 3.2 COMMUNITY LICENSE AGREEMENT\n\nLlama 3.2 Version\ \ Release Date: September 25, 2024\n\n“Agreement” means the terms and conditions\ \ for use, reproduction, distribution and modification of the Llama Materials set\ \ forth herein.\n\n“Documentation” m
google_gemma-4-31B-it-GGUF
Qwen_Qwen3.5-397B-A17B-GGUF
google_gemma-4-26B-A4B-it-GGUF
Qwen_Qwen3.5-35B-A3B-GGUF
Meta-Llama-3.1-8B-Instruct-GGUF
--- base_model: meta-llama/Meta-Llama-3.1-8B-Instruct language: - en - de - fr - it - pt - hi - es - th license: llama3.1 pipeline_tag: text-generation tags: - facebook - meta - pytorch - llama - llama-3 quantized_by: bartowski extra_gated_prompt: "### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\nLlama 3.1 Version\ \ Release Date: July 23, 2024\n\"Agreement\" means the terms and conditions for\ \ use, reproduction, distribution and modification of the Llama Materials set forth\ \ herein.\n\"Documenta
google_gemma-4-E4B-it-GGUF
gemma-2-2b-it-GGUF
Qwen_Qwen3.5-27B-GGUF
google_gemma-4-E2B-it-GGUF
Llama-3.2-1B-Instruct-GGUF
--- base_model: meta-llama/Llama-3.2-1B-Instruct language: - en - de - fr - it - pt - hi - es - th license: llama3.2 pipeline_tag: text-generation tags: - facebook - meta - llama - llama-3 quantized_by: bartowski extra_gated_prompt: "### LLAMA 3.2 COMMUNITY LICENSE AGREEMENT\n\nLlama 3.2 Version\ \ Release Date: September 25, 2024\n\n“Agreement” means the terms and conditions\ \ for use, reproduction, distribution and modification of the Llama Materials set\ \ forth herein.\n\n“Documentation” m
Qwen_Qwen3.5-9B-GGUF
Qwen_Qwen3.5-122B-A10B-GGUF
openai_gpt-oss-20b-GGUF
Llamacpp imatrix Quantizations of gpt-oss-20b by openai Original model: https://huggingface.co/openai/gpt-oss-20b All quants made using imatrix option with combinedallmedium dataset from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project All quants keep the feed forward networks at mxfp4 for optimal performance, which does mean the size differences are negligible unfortunately, but being provided just because. No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-20b-MXFP4.gguf | MXFP4 | 12.1GB | false | Full MXFP4 weights, recommended for this model. | The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything. The rest of these are provided for your own interest in case you feel like experimenting, but the size savings is basically non-existent so I would not recommend running them, they are provided simply for show: | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-20b-Q6KL.gguf | Q6KL | 12.04GB | false | Uses Q80 for embed and output weights. Q6K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q6K.gguf | Q6K | 12.04GB | false | Q6K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q5KL.gguf | Q5KL | 11.91GB | false | Uses Q80 for embed and output weights. Q5K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q4KL.gguf | Q4KL | 11.89GB | false | Uses Q80 for embed and output weights. Q4K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q2KL.gguf | Q2KL | 11.85GB | false | Uses Q80 for embed and output weights. Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KXL.gguf | Q3KXL | 11.78GB | false | Uses Q80 for embed and output weights. Q3KL with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q5KM.gguf | Q5KM | 11.73GB | false | Q5KM with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q5KS.gguf | Q5KS | 11.72GB | false | Q5KS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q4KM.gguf | Q4KM | 11.67GB | false | Q4KM with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q4KS.gguf | Q4KS | 11.67GB | false | Q4KS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q41.gguf | Q41 | 11.59GB | false | Q41 with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ4NL.gguf | IQ4NL | 11.56GB | false | IQ4NL with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ4XS.gguf | IQ4XS | 11.56GB | false | IQ4XS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KM.gguf | Q3KM | 11.56GB | false | Q3KM with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ3M.gguf | IQ3M | 11.56GB | false | IQ3M with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ3XS.gguf | IQ3XS | 11.56GB | false | IQ3XS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ3XXS.gguf | IQ3XXS | 11.56GB | false | IQ3XXS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q2K.gguf | Q2K | 11.56GB | false | Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KS.gguf | Q3KS | 11.55GB | false | Q3KS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2M.gguf | IQ2M | 11.55GB | false | IQ2M with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2S.gguf | IQ2S | 11.55GB | false | IQ2S with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q40.gguf | Q40 | 11.52GB | false | Q40 with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2XS.gguf | IQ2XS | 11.51GB | false | IQ2XS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-IQ2XXS.gguf | IQ2XXS | 11.51GB | false | IQ2XXS with all FFN kept at MXFP4MOE. | | gpt-oss-20b-Q3KL.gguf | Q3KL | 11.49GB | false | Q3KL with all FFN kept at MXFP4MOE. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (openaigpt-oss-20b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF
Llamacpp imatrix Quantizations of Dolphin-Mistral-24B-Venice-Edition by cognitivecomputations
Llama-3.2-3B-Instruct-uncensored-GGUF
Qwen2.5-7B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen2.5-7B-Instruct Original model: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen2.5-7B-Instruct-f16.gguf | f16 | 15.24GB | false | Full F16 weights. | | Qwen2.5-7B-Instruct-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen2.5-7B-Instruct-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen2.5-7B-Instruct-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | Qwen2.5-7B-Instruct-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen2.5-7B-Instruct-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | Qwen2.5-7B-Instruct-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | Qwen2.5-7B-Instruct-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen2.5-7B-Instruct-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for must use cases, recommended. | | Qwen2.5-7B-Instruct-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen2.5-7B-Instruct-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | Qwen2.5-7B-Instruct-Q40.gguf | Q40 | 4.44GB | false | Legacy format, generally not worth using over similarly sized formats | | Qwen2.5-7B-Instruct-Q4088.gguf | Q4088 | 4.43GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). | | Qwen2.5-7B-Instruct-Q4048.gguf | Q4048 | 4.43GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). | | Qwen2.5-7B-Instruct-Q4044.gguf | Q4044 | 4.43GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. | | Qwen2.5-7B-Instruct-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen2.5-7B-Instruct-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | Qwen2.5-7B-Instruct-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | Qwen2.5-7B-Instruct-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen2.5-7B-Instruct-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen2.5-7B-Instruct-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | Qwen2.5-7B-Instruct-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen2.5-7B-Instruct-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | Qwen2.5-7B-Instruct-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Qwen2.5-7B-Instruct-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen3.5-4B-GGUF
Phi-3.5-mini-instruct-GGUF
TheDrummer_Cydonia-24B-v4.2.0-GGUF
Llamacpp imatrix Quantizations of Cydonia-24B-v4.2.0 by TheDrummer Original model: https://huggingface.co/TheDrummer/Cydonia-24B-v4.2.0 All quants made using imatrix option with dataset from here c...
Qwen2.5-Coder-32B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen2.5-Coder-32B-Instruct Original model: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen2.5-Coder-32B-Instruct-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen2.5-Coder-32B-Instruct-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen2.5-Coder-32B-Instruct-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | Qwen2.5-Coder-32B-Instruct-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen2.5-Coder-32B-Instruct-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | Qwen2.5-Coder-32B-Instruct-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | Qwen2.5-Coder-32B-Instruct-Q40.gguf | Q40 | 18.71GB | false | Legacy format, generally not worth using over similarly sized formats | | Qwen2.5-Coder-32B-Instruct-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. | | Qwen2.5-Coder-32B-Instruct-Q4088.gguf | Q4088 | 18.64GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows. | | Qwen2.5-Coder-32B-Instruct-Q4048.gguf | Q4048 | 18.64GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows. | | Qwen2.5-Coder-32B-Instruct-Q4044.gguf | Q4044 | 18.64GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows. | | Qwen2.5-Coder-32B-Instruct-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen2.5-Coder-32B-Instruct-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen2.5-Coder-32B-Instruct-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | Qwen2.5-Coder-32B-Instruct-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | Qwen2.5-Coder-32B-Instruct-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen2.5-Coder-32B-Instruct-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen2.5-Coder-32B-Instruct-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen2.5-Coder-32B-Instruct-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen2.5-Coder-32B-Instruct-IQ3XXS.gguf | IQ3XXS | 12.84GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen2.5-Coder-32B-Instruct-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | Qwen2.5-Coder-32B-Instruct-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen2.5-Coder-32B-Instruct-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen2.5-Coder-32B-Instruct-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen2.5-Coder-32B-Instruct-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Qwen2.5-Coder-32B-Instruct-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Mistral-Nemo-Instruct-2407-GGUF
Llamacpp imatrix Quantizations of Mistral-Nemo-Instruct-2407 Original model: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Mistral-Nemo-Instruct-2407-f16.gguf | f16 | 24.50GB | false | Full F16 weights. | | Mistral-Nemo-Instruct-2407-Q80.gguf | Q80 | 13.02GB | false | Extremely high quality, generally unneeded but max available quant. | | Mistral-Nemo-Instruct-2407-Q6KL.gguf | Q6KL | 10.38GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Mistral-Nemo-Instruct-2407-Q6K.gguf | Q6K | 10.06GB | false | Very high quality, near perfect, recommended. | | Mistral-Nemo-Instruct-2407-Q5KL.gguf | Q5KL | 9.14GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Mistral-Nemo-Instruct-2407-Q5KM.gguf | Q5KM | 8.73GB | false | High quality, recommended. | | Mistral-Nemo-Instruct-2407-Q5KS.gguf | Q5KS | 8.52GB | false | High quality, recommended. | | Mistral-Nemo-Instruct-2407-Q4KL.gguf | Q4KL | 7.98GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Mistral-Nemo-Instruct-2407-Q4KM.gguf | Q4KM | 7.48GB | false | Good quality, default size for must use cases, recommended. | | Mistral-Nemo-Instruct-2407-Q3KXL.gguf | Q3KXL | 7.15GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Mistral-Nemo-Instruct-2407-Q4KS.gguf | Q4KS | 7.12GB | false | Slightly lower quality with more space savings, recommended. | | Mistral-Nemo-Instruct-2407-Q40.gguf | Q40 | 7.09GB | false | Legacy format, generally not worth using over similarly sized formats | | Mistral-Nemo-Instruct-2407-Q4088.gguf | Q4088 | 7.07GB | false | Optimized for ARM inference. Requires 'sve' support (see link below). Don't use on Mac or Windows. | | Mistral-Nemo-Instruct-2407-Q4048.gguf | Q4048 | 7.07GB | false | Optimized for ARM inference. Requires 'i8mm' support (see link below). Don't use on Mac or Windows. | | Mistral-Nemo-Instruct-2407-Q4044.gguf | Q4044 | 7.07GB | false | Optimized for ARM inference. Should work well on all ARM chips, pick this if you're unsure. Don't use on Mac or Windows. | | Mistral-Nemo-Instruct-2407-IQ4XS.gguf | IQ4XS | 6.74GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Mistral-Nemo-Instruct-2407-Q3KL.gguf | Q3KL | 6.56GB | false | Lower quality but usable, good for low RAM availability. | | Mistral-Nemo-Instruct-2407-Q3KM.gguf | Q3KM | 6.08GB | false | Low quality. | | Mistral-Nemo-Instruct-2407-IQ3M.gguf | IQ3M | 5.72GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Mistral-Nemo-Instruct-2407-Q3KS.gguf | Q3KS | 5.53GB | false | Low quality, not recommended. | | Mistral-Nemo-Instruct-2407-Q2KL.gguf | Q2KL | 5.45GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Mistral-Nemo-Instruct-2407-IQ3XS.gguf | IQ3XS | 5.31GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Mistral-Nemo-Instruct-2407-Q2K.gguf | Q2K | 4.79GB | false | Very low quality but surprisingly usable. | | Mistral-Nemo-Instruct-2407-IQ2M.gguf | IQ2M | 4.44GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Mistral-Nemo-Instruct-2407-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
DeepSeek-R1-Distill-Qwen-32B-abliterated-GGUF
Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-32B-abliterated Original model: https://huggingface.co/huihui-ai/DeepSeek-R1-Distill-Qwen-32B-abliterated All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-32B-abliterated-bf16.gguf | bf16 | 65.54GB | true | Full BF16 weights. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q40.gguf | Q40 | 18.71GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-abliterated-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-32B-abliterated-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
magnum-v4-12b-GGUF
Qwen2.5-72B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen2.5-72B-Instruct Original model: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen2.5-72B-Instruct-Q80.gguf | Q80 | 77.26GB | true | Extremely high quality, generally unneeded but max available quant. | | Qwen2.5-72B-Instruct-Q6K.gguf | Q6K | 64.35GB | true | Very high quality, near perfect, recommended. | | Qwen2.5-72B-Instruct-Q5KM.gguf | Q5KM | 54.45GB | true | High quality, recommended. | | Qwen2.5-72B-Instruct-Q4KM.gguf | Q4KM | 47.42GB | false | Good quality, default size for must use cases, recommended. | | Qwen2.5-72B-Instruct-Q40.gguf | Q40 | 41.38GB | false | Legacy format, generally not worth using over similarly sized formats | | Qwen2.5-72B-Instruct-Q3KXL.gguf | Q3KXL | 40.60GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen2.5-72B-Instruct-IQ4XS.gguf | IQ4XS | 39.71GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen2.5-72B-Instruct-Q3KL.gguf | Q3KL | 39.51GB | false | Lower quality but usable, good for low RAM availability. | | Qwen2.5-72B-Instruct-Q3KM.gguf | Q3KM | 37.70GB | false | Low quality. | | Qwen2.5-72B-Instruct-IQ3M.gguf | IQ3M | 35.50GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen2.5-72B-Instruct-Q3KS.gguf | Q3KS | 34.49GB | false | Low quality, not recommended. | | Qwen2.5-72B-Instruct-IQ3XXS.gguf | IQ3XXS | 31.85GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen2.5-72B-Instruct-Q2KL.gguf | Q2KL | 31.03GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen2.5-72B-Instruct-Q2K.gguf | Q2K | 29.81GB | false | Very low quality but surprisingly usable. | | Qwen2.5-72B-Instruct-IQ2M.gguf | IQ2M | 29.34GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen2.5-72B-Instruct-IQ2XS.gguf | IQ2XS | 27.06GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen2.5-72B-Instruct-IQ2XXS.gguf | IQ2XXS | 25.49GB | false | Very low quality, uses SOTA techniques to be usable. | | Qwen2.5-72B-Instruct-IQ1M.gguf | IQ1M | 23.74GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. Some say that this improves the quality, others don't notice any difference. If you use these models PLEASE COMMENT with your findings. I would like feedback that these are actually used and useful so I don't keep uploading quants no one is using. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Qwen2.5-72B-Instruct-Q80) or download them all in place (./) These are NOT for Metal (Apple) offloading, only ARM chips. If you're using an ARM chip, the Q40XX quants will have a substantial speedup. Check out Q4044 speed comparisons on the original pull request To check which one would work best for your ARM chip, you can check AArch64 SoC features (thanks EloyOn!). A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen3-Coder-Next-GGUF
Meta-Llama-3-8B-Instruct-GGUF
NemoMix-Unleashed-12B-GGUF
mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF
DeepSeek-R1-Distill-Qwen-1.5B-GGUF
Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-1.5B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-1.5B-f32.gguf | f32 | 7.11GB | false | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-1.5B-f16.gguf | f16 | 3.56GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-1.5B-Q80.gguf | Q80 | 1.89GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-1.5B-Q6KL.gguf | Q6KL | 1.58GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q6K.gguf | Q6K | 1.46GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KL.gguf | Q5KL | 1.43GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KM.gguf | Q5KM | 1.29GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KL.gguf | Q4KL | 1.29GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KS.gguf | Q5KS | 1.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KXL.gguf | Q3KXL | 1.18GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-1.5B-Q41.gguf | Q41 | 1.16GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KM.gguf | Q4KM | 1.12GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KS.gguf | Q4KS | 1.07GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q40.gguf | Q40 | 1.07GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ4NL.gguf | IQ4NL | 1.07GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ4XS.gguf | IQ4XS | 1.02GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KL.gguf | Q3KL | 0.98GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-1.5B-Q2KL.gguf | Q2KL | 0.98GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KM.gguf | Q3KM | 0.92GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ3M.gguf | IQ3M | 0.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KS.gguf | Q3KS | 0.86GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-1.5B-Q2K.gguf | Q2K | 0.75GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-1.5B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
SmolLM2-1.7B-Instruct-GGUF
Alibaba-NLP_Tongyi-DeepResearch-30B-A3B-GGUF
Llamacpp imatrix Quantizations of Tongyi-DeepResearch-30B-A3B by Alibaba-NLP Original model: https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Tongyi-DeepResearch-30B-A3B-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Tongyi-DeepResearch-30B-A3B-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Tongyi-DeepResearch-30B-A3B-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Tongyi-DeepResearch-30B-A3B-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Tongyi-DeepResearch-30B-A3B-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Tongyi-DeepResearch-30B-A3B-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Tongyi-DeepResearch-30B-A3B-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Tongyi-DeepResearch-30B-A3B-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Tongyi-DeepResearch-30B-A3B-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Tongyi-DeepResearch-30B-A3B-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Tongyi-DeepResearch-30B-A3B-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Tongyi-DeepResearch-30B-A3B-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Tongyi-DeepResearch-30B-A3B-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Tongyi-DeepResearch-30B-A3B-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Tongyi-DeepResearch-30B-A3B-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Tongyi-DeepResearch-30B-A3B-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Tongyi-DeepResearch-30B-A3B-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Tongyi-DeepResearch-30B-A3B-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Tongyi-DeepResearch-30B-A3B-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Tongyi-DeepResearch-30B-A3B-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Tongyi-DeepResearch-30B-A3B-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Tongyi-DeepResearch-30B-A3B-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Tongyi-DeepResearch-30B-A3B-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Tongyi-DeepResearch-30B-A3B-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Alibaba-NLPTongyi-DeepResearch-30B-A3B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
google_gemma-3-4b-it-GGUF
cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF
Meta-Llama-3.1-70B-Instruct-GGUF
TheDrummer_Cydonia-24B-v4.1-GGUF
Llamacpp imatrix Quantizations of Cydonia-24B-v4.1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Cydonia-24B-v4.1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Cydonia-24B-v4.1-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Cydonia-24B-v4.1-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Cydonia-24B-v4.1-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Cydonia-24B-v4.1-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Cydonia-24B-v4.1-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Cydonia-24B-v4.1-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Cydonia-24B-v4.1-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Cydonia-24B-v4.1-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Cydonia-24B-v4.1-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Cydonia-24B-v4.1-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Cydonia-24B-v4.1-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Cydonia-24B-v4.1-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Cydonia-24B-v4.1-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Cydonia-24B-v4.1-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Cydonia-24B-v4.1-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Cydonia-24B-v4.1-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Cydonia-24B-v4.1-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Cydonia-24B-v4.1-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Cydonia-24B-v4.1-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Cydonia-24B-v4.1-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Cydonia-24B-v4.1-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Cydonia-24B-v4.1-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Cydonia-24B-v4.1-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Cydonia-24B-v4.1-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Cydonia-24B-v4.1-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Cydonia-24B-v4.1-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerCydonia-24B-v4.1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
mlabonne_Qwen3-14B-abliterated-GGUF
Llamacpp imatrix Quantizations of Qwen3-14B-abliterated by mlabonne Original model: https://huggingface.co/mlabonne/Qwen3-14B-abliterated All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-14B-abliterated-bf16.gguf | bf16 | 29.54GB | false | Full BF16 weights. | | Qwen3-14B-abliterated-Q80.gguf | Q80 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-14B-abliterated-Q6KL.gguf | Q6KL | 12.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-14B-abliterated-Q6K.gguf | Q6K | 12.12GB | false | Very high quality, near perfect, recommended. | | Qwen3-14B-abliterated-Q5KL.gguf | Q5KL | 10.99GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-14B-abliterated-Q5KM.gguf | Q5KM | 10.51GB | false | High quality, recommended. | | Qwen3-14B-abliterated-Q5KS.gguf | Q5KS | 10.26GB | false | High quality, recommended. | | Qwen3-14B-abliterated-Q4KL.gguf | Q4KL | 9.58GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-14B-abliterated-Q41.gguf | Q41 | 9.39GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-14B-abliterated-Q4KM.gguf | Q4KM | 9.00GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-14B-abliterated-Q3KXL.gguf | Q3KXL | 8.58GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-14B-abliterated-Q4KS.gguf | Q4KS | 8.57GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-14B-abliterated-Q40.gguf | Q40 | 8.54GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-14B-abliterated-IQ4NL.gguf | IQ4NL | 8.54GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-14B-abliterated-IQ4XS.gguf | IQ4XS | 8.11GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-14B-abliterated-Q3KL.gguf | Q3KL | 7.90GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-14B-abliterated-Q3KM.gguf | Q3KM | 7.32GB | false | Low quality. | | Qwen3-14B-abliterated-IQ3M.gguf | IQ3M | 6.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-14B-abliterated-Q3KS.gguf | Q3KS | 6.66GB | false | Low quality, not recommended. | | Qwen3-14B-abliterated-Q2KL.gguf | Q2KL | 6.51GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-14B-abliterated-IQ3XS.gguf | IQ3XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-14B-abliterated-IQ3XXS.gguf | IQ3XXS | 5.94GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-14B-abliterated-Q2K.gguf | Q2K | 5.75GB | false | Very low quality but surprisingly usable. | | Qwen3-14B-abliterated-IQ2M.gguf | IQ2M | 5.32GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-14B-abliterated-IQ2S.gguf | IQ2S | 4.96GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mlabonneQwen3-14B-abliterated-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
DeepSeek-R1-Distill-Llama-8B-GGUF
DeepSeek-R1-Distill-Qwen-7B-GGUF
Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-7B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-7B-f32.gguf | f32 | 30.47GB | false | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-7B-f16.gguf | f16 | 15.24GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-7B-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-7B-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q41.gguf | Q41 | 4.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-7B-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-7B-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q40.gguf | Q40 | 4.44GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-7B-IQ4NL.gguf | IQ4NL | 4.44GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-7B-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-7B-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-7B-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-7B-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-7B-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-7B-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-7B-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-7B-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-7B-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-7B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
TheDrummer_Magidonia-24B-v4.2.0-GGUF
Llamacpp imatrix Quantizations of Magidonia-24B-v4.2.0 by TheDrummer Original model: https://huggingface.co/TheDrummer/Magidonia-24B-v4.2.0 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Magidonia-24B-v4.2.0-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Magidonia-24B-v4.2.0-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Magidonia-24B-v4.2.0-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Magidonia-24B-v4.2.0-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Magidonia-24B-v4.2.0-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Magidonia-24B-v4.2.0-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Magidonia-24B-v4.2.0-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Magidonia-24B-v4.2.0-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Magidonia-24B-v4.2.0-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Magidonia-24B-v4.2.0-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Magidonia-24B-v4.2.0-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Magidonia-24B-v4.2.0-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Magidonia-24B-v4.2.0-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Magidonia-24B-v4.2.0-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Magidonia-24B-v4.2.0-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Magidonia-24B-v4.2.0-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Magidonia-24B-v4.2.0-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Magidonia-24B-v4.2.0-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Magidonia-24B-v4.2.0-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Magidonia-24B-v4.2.0-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Magidonia-24B-v4.2.0-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Magidonia-24B-v4.2.0-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Magidonia-24B-v4.2.0-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Magidonia-24B-v4.2.0-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Magidonia-24B-v4.2.0-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Magidonia-24B-v4.2.0-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerMagidonia-24B-v4.2.0-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen3.5-0.8B-GGUF
DeepSeek-R1-Distill-Qwen-14B-GGUF
Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-14B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-14B-f32.gguf | f32 | 59.09GB | true | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-14B-f16.gguf | f16 | 29.55GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-14B-Q80.gguf | Q80 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-14B-Q6KL.gguf | Q6KL | 12.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q6K.gguf | Q6K | 12.12GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q5KL.gguf | Q5KL | 10.99GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q5KM.gguf | Q5KM | 10.51GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q5KS.gguf | Q5KS | 10.27GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q4KL.gguf | Q4KL | 9.57GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q41.gguf | Q41 | 9.39GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-14B-Q4KM.gguf | Q4KM | 8.99GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q3KXL.gguf | Q3KXL | 8.61GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-14B-Q4KS.gguf | Q4KS | 8.57GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-14B-IQ4NL.gguf | IQ4NL | 8.55GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-14B-Q40.gguf | Q40 | 8.54GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-14B-IQ4XS.gguf | IQ4XS | 8.12GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q3KL.gguf | Q3KL | 7.92GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-14B-Q3KM.gguf | Q3KM | 7.34GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-14B-IQ3M.gguf | IQ3M | 6.92GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-14B-Q3KS.gguf | Q3KS | 6.66GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-14B-Q2KL.gguf | Q2KL | 6.53GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ3XS.gguf | IQ3XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-14B-Q2K.gguf | Q2K | 5.77GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ2M.gguf | IQ2M | 5.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ2S.gguf | IQ2S | 5.00GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-14B-IQ2XS.gguf | IQ2XS | 4.70GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-14B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
PocketDoc_Dans-PersonalityEngine-V1.2.0-24b-GGUF
Qwen2.5-14B-Instruct-GGUF
Qwen_Qwen3.5-2B-GGUF
gemma-2-9b-it-GGUF
Original model: https://huggingface.co/google/gemma-2-9b-it All quants made using imatrix option with dataset from here Note that this model does not support a System prompt. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gemma-2-9b-it-f32.gguf | f32 | 36.97GB | false | Full F32 weights. | | gemma-2-9b-it-Q80.gguf | Q80 | 9.83GB | false | Extremely high quality, generally unneeded but max available quant. | | gemma-2-9b-it-Q6KL.gguf | Q6KL | 7.81GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | gemma-2-9b-it-Q6K.gguf | Q6K | 7.59GB | false | Very high quality, near perfect, recommended. | | gemma-2-9b-it-Q5KL.gguf | Q5KL | 6.87GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | gemma-2-9b-it-Q5KM.gguf | Q5KM | 6.65GB | false | High quality, recommended. | | gemma-2-9b-it-Q5KS.gguf | Q5KS | 6.48GB | false | High quality, recommended. | | gemma-2-9b-it-Q4KL.gguf | Q4KL | 5.98GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | gemma-2-9b-it-Q4KM.gguf | Q4KM | 5.76GB | false | Good quality, default size for must use cases, recommended. | | gemma-2-9b-it-Q4KS.gguf | Q4KS | 5.48GB | false | Slightly lower quality with more space savings, recommended. | | gemma-2-9b-it-IQ4XS.gguf | IQ4XS | 5.18GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | gemma-2-9b-it-Q3KL.gguf | Q3KL | 5.13GB | false | Lower quality but usable, good for low RAM availability. | | gemma-2-9b-it-Q3KM.gguf | Q3KM | 4.76GB | false | Low quality. | | gemma-2-9b-it-IQ3M.gguf | IQ3M | 4.49GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | gemma-2-9b-it-Q3KS.gguf | Q3KS | 4.34GB | false | Low quality, not recommended. | | gemma-2-9b-it-IQ3XS.gguf | IQ3XS | 4.14GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | gemma-2-9b-it-Q2KL.gguf | Q2KL | 4.03GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | gemma-2-9b-it-Q2K.gguf | Q2K | 3.81GB | false | Very low quality but surprisingly usable. | | gemma-2-9b-it-IQ3XXS.gguf | IQ3XXS | 3.80GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | gemma-2-9b-it-IQ2M.gguf | IQ2M | 3.43GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset Thank you ZeroWw for the inspiration to experiment with embed/output First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (gemma-2-9b-it-Q80) or download them all in place (./) A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
google_gemma-3-27b-it-qat-GGUF
TheDrummer_Cydonia-24B-v2-GGUF
mistralai_Ministral-3-14B-Reasoning-2512-GGUF
Qwen_Qwen3-Next-80B-A3B-Thinking-GGUF
Llama-3.3-70B-Instruct-GGUF
Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF
Llamacpp imatrix Quantizations of Qwen3-30B-A3B-Instruct-2507 by Qwen Original model: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-30B-A3B-Instruct-2507-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Qwen3-30B-A3B-Instruct-2507-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-30B-A3B-Instruct-2507-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-30B-A3B-Instruct-2507-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-30B-A3B-Instruct-2507-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-30B-A3B-Instruct-2507-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-30B-A3B-Instruct-2507-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-30B-A3B-Instruct-2507-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-30B-A3B-Instruct-2507-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Qwen3-30B-A3B-Instruct-2507-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-30B-A3B-Instruct-2507-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Qwen3-30B-A3B-Instruct-2507-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-30B-A3B-Instruct-2507-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-30B-A3B-Instruct-2507-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-30B-A3B-Instruct-2507-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-30B-A3B-Instruct-2507-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-30B-A3B-Instruct-2507-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
zai-org_GLM-4.6-GGUF
Llamacpp imatrix Quantizations of GLM-4.6 by zai-org Original model: https://huggingface.co/zai-org/GLM-4.6 All quants made using imatrix option with dataset from here combined with a subset of com...
DeepSeek-R1-Distill-Qwen-32B-GGUF
Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-32B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-32B-bf16.gguf | bf16 | 65.54GB | true | Full BF16 weights. | | DeepSeek-R1-Distill-Qwen-32B-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-32B-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-32B-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q40.gguf | Q40 | 18.71GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-32B-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-32B-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-32B-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-32B-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-32B-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-32B-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-32B-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | DeepSeek-R1-Distill-Qwen-32B-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-32B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
DeepSeek-Coder-V2-Lite-Instruct-GGUF
nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Qwen2.5-Coder-7B-Instruct-GGUF
huihui-ai_Huihui-gpt-oss-20b-BF16-abliterated-GGUF
Phi-3.1-mini-128k-instruct-GGUF
granite-embedding-107m-multilingual-GGUF
cerebras_Qwen3-Coder-REAP-25B-A3B-GGUF
thesby_Qwen2.5-VL-7B-NSFW-Caption-V3-GGUF
mistralai_Mistral-Small-4-119B-2603-GGUF
Qwen2.5-14B_Uncensored_Instruct-GGUF
mistral-community_pixtral-12b-GGUF
Hermes-3-Llama-3.2-3B-GGUF
moonshotai_Kimi-K2.5-GGUF
Ministral-8B-Instruct-2410-GGUF
L3-8B-Stheno-v3.2-GGUF
Llamacpp imatrix Quantizations of L3-8B-Stheno-v3.2 Original model: https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2 All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Description | | -------- | ---------- | --------- | ----------- | | L3-8B-Stheno-v3.2-Q80.gguf | Q80 | 8.54GB | Extremely high quality, generally unneeded but max available quant. | | L3-8B-Stheno-v3.2-Q6K.gguf | Q6K | 6.59GB | Very high quality, near perfect, recommended. | | L3-8B-Stheno-v3.2-Q5KM.gguf | Q5KM | 5.73GB | High quality, recommended. | | L3-8B-Stheno-v3.2-Q5KS.gguf | Q5KS | 5.59GB | High quality, recommended. | | L3-8B-Stheno-v3.2-Q4KM.gguf | Q4KM | 4.92GB | Good quality, uses about 4.83 bits per weight, recommended. | | L3-8B-Stheno-v3.2-Q4KS.gguf | Q4KS | 4.69GB | Slightly lower quality with more space savings, recommended. | | L3-8B-Stheno-v3.2-IQ4XS.gguf | IQ4XS | 4.44GB | Decent quality, smaller than Q4KS with similar performance, recommended. | | L3-8B-Stheno-v3.2-Q3KL.gguf | Q3KL | 4.32GB | Lower quality but usable, good for low RAM availability. | | L3-8B-Stheno-v3.2-Q3KM.gguf | Q3KM | 4.01GB | Even lower quality. | | L3-8B-Stheno-v3.2-IQ3M.gguf | IQ3M | 3.78GB | Medium-low quality, new method with decent performance comparable to Q3KM. | | L3-8B-Stheno-v3.2-Q3KS.gguf | Q3KS | 3.66GB | Low quality, not recommended. | | L3-8B-Stheno-v3.2-IQ3XS.gguf | IQ3XS | 3.51GB | Lower quality, new method with decent performance, slightly better than Q3KS. | | L3-8B-Stheno-v3.2-IQ3XXS.gguf | IQ3XXS | 3.27GB | Lower quality, new method with decent performance, comparable to Q3 quants. | | L3-8B-Stheno-v3.2-Q2K.gguf | Q2K | 3.17GB | Very low quality but surprisingly usable. | | L3-8B-Stheno-v3.2-IQ2M.gguf | IQ2M | 2.94GB | Very low quality, uses SOTA techniques to also be surprisingly usable. | | L3-8B-Stheno-v3.2-IQ2S.gguf | IQ2S | 2.75GB | Very low quality, uses SOTA techniques to be usable. | | L3-8B-Stheno-v3.2-IQ2XS.gguf | IQ2XS | 2.60GB | Very low quality, uses SOTA techniques to be usable. | First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (L3-8B-Stheno-v3.2-Q80) or download them all in place (./) A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen3-30B-A3B-Thinking-2507-GGUF
rwkv-6-world-7b-GGUF
Athene-V2-Chat-GGUF
mistralai_Magistral-Small-2509-GGUF
Llamacpp imatrix Quantizations of Magistral-Small-2509 by mistralai Original model: https://huggingface.co/mistralai/Magistral-Small-2509 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Magistral-Small-2509-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Magistral-Small-2509-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Magistral-Small-2509-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Magistral-Small-2509-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Magistral-Small-2509-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Magistral-Small-2509-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Magistral-Small-2509-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Magistral-Small-2509-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Magistral-Small-2509-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Magistral-Small-2509-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Magistral-Small-2509-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Magistral-Small-2509-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Magistral-Small-2509-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Magistral-Small-2509-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Magistral-Small-2509-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Magistral-Small-2509-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Magistral-Small-2509-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Magistral-Small-2509-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Magistral-Small-2509-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Magistral-Small-2509-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Magistral-Small-2509-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Magistral-Small-2509-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Magistral-Small-2509-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Magistral-Small-2509-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Magistral-Small-2509-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Magistral-Small-2509-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | | Magistral-Small-2509-IQ2XXS.gguf | IQ2XXS | 6.55GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mistralaiMagistral-Small-2509-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
google_gemma-3n-E4B-it-GGUF
mlabonne_gemma-3-27b-it-abliterated-GGUF
PocketDoc_Dans-PersonalityEngine-V1.3.0-24b-GGUF
Llamacpp imatrix Quantizations of Dans-PersonalityEngine-V1.3.0-24b by PocketDoc Original model: https://huggingface.co/PocketDoc/Dans-PersonalityEngine-V1.3.0-24b All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Dans-PersonalityEngine-V1.3.0-24b-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Dans-PersonalityEngine-V1.3.0-24b-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | Dans-PersonalityEngine-V1.3.0-24b-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Dans-PersonalityEngine-V1.3.0-24b-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Dans-PersonalityEngine-V1.3.0-24b-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Dans-PersonalityEngine-V1.3.0-24b-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Dans-PersonalityEngine-V1.3.0-24b-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Dans-PersonalityEngine-V1.3.0-24b-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Dans-PersonalityEngine-V1.3.0-24b-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Dans-PersonalityEngine-V1.3.0-24b-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Dans-PersonalityEngine-V1.3.0-24b-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Dans-PersonalityEngine-V1.3.0-24b-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (PocketDocDans-PersonalityEngine-V1.3.0-24b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
ServiceNow-AI_Apriel-1.5-15b-Thinker-GGUF
Llamacpp imatrix Quantizations of Apriel-1.5-15b-Thinker by ServiceNow-AI Original model: https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Apriel-1.5-15b-Thinker-bf16.gguf | bf16 | 28.87GB | false | Full BF16 weights. | | Apriel-1.5-15b-Thinker-Q80.gguf | Q80 | 15.34GB | false | Extremely high quality, generally unneeded but max available quant. | | Apriel-1.5-15b-Thinker-Q6KL.gguf | Q6KL | 12.17GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Apriel-1.5-15b-Thinker-Q6K.gguf | Q6K | 11.85GB | false | Very high quality, near perfect, recommended. | | Apriel-1.5-15b-Thinker-Q5KL.gguf | Q5KL | 10.68GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Apriel-1.5-15b-Thinker-Q5KM.gguf | Q5KM | 10.27GB | false | High quality, recommended. | | Apriel-1.5-15b-Thinker-Q5KS.gguf | Q5KS | 10.02GB | false | High quality, recommended. | | Apriel-1.5-15b-Thinker-Q4KL.gguf | Q4KL | 9.28GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Apriel-1.5-15b-Thinker-Q41.gguf | Q41 | 9.16GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Apriel-1.5-15b-Thinker-Q4KM.gguf | Q4KM | 8.79GB | false | Good quality, default size for most use cases, recommended. | | Apriel-1.5-15b-Thinker-Q4KS.gguf | Q4KS | 8.36GB | false | Slightly lower quality with more space savings, recommended. | | Apriel-1.5-15b-Thinker-Q40.gguf | Q40 | 8.33GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Apriel-1.5-15b-Thinker-IQ4NL.gguf | IQ4NL | 8.33GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Apriel-1.5-15b-Thinker-Q3KXL.gguf | Q3KXL | 8.29GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Apriel-1.5-15b-Thinker-IQ4XS.gguf | IQ4XS | 7.91GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Apriel-1.5-15b-Thinker-Q3KL.gguf | Q3KL | 7.70GB | false | Lower quality but usable, good for low RAM availability. | | Apriel-1.5-15b-Thinker-Q3KM.gguf | Q3KM | 7.14GB | false | Low quality. | | Apriel-1.5-15b-Thinker-IQ3M.gguf | IQ3M | 6.70GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Apriel-1.5-15b-Thinker-Q3KS.gguf | Q3KS | 6.47GB | false | Low quality, not recommended. | | Apriel-1.5-15b-Thinker-Q2KL.gguf | Q2KL | 6.25GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Apriel-1.5-15b-Thinker-IQ3XS.gguf | IQ3XS | 6.20GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Apriel-1.5-15b-Thinker-IQ3XXS.gguf | IQ3XXS | 5.78GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Apriel-1.5-15b-Thinker-Q2K.gguf | Q2K | 5.59GB | false | Very low quality but surprisingly usable. | | Apriel-1.5-15b-Thinker-IQ2M.gguf | IQ2M | 5.17GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Apriel-1.5-15b-Thinker-IQ2S.gguf | IQ2S | 4.81GB | false | Low quality, uses SOTA techniques to be usable. | | Apriel-1.5-15b-Thinker-IQ2XS.gguf | IQ2XS | 4.56GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ServiceNow-AIApriel-1.5-15b-Thinker-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen3-30B-A3B-GGUF
google_gemma-3-27b-it-GGUF
THUDM_GLM-Z1-32B-0414-GGUF
Llamacpp imatrix Quantizations of GLM-Z1-32B-0414 by THUDM Original model: https://huggingface.co/THUDM/GLM-Z1-32B-0414 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | GLM-Z1-32B-0414-bf16.gguf | bf16 | 65.14GB | true | Full BF16 weights. | | GLM-Z1-32B-0414-Q80.gguf | Q80 | 34.62GB | false | Extremely high quality, generally unneeded but max available quant. | | GLM-Z1-32B-0414-Q6KL.gguf | Q6KL | 27.18GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | GLM-Z1-32B-0414-Q6K.gguf | Q6K | 26.73GB | false | Very high quality, near perfect, recommended. | | GLM-Z1-32B-0414-Q5KL.gguf | Q5KL | 23.67GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | GLM-Z1-32B-0414-Q5KM.gguf | Q5KM | 23.10GB | false | High quality, recommended. | | GLM-Z1-32B-0414-Q5KS.gguf | Q5KS | 22.53GB | false | High quality, recommended. | | GLM-Z1-32B-0414-Q41.gguf | Q41 | 20.55GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | GLM-Z1-32B-0414-Q4KL.gguf | Q4KL | 20.37GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | GLM-Z1-32B-0414-Q4KM.gguf | Q4KM | 19.68GB | false | Good quality, default size for most use cases, recommended. | | GLM-Z1-32B-0414-Q4KS.gguf | Q4KS | 18.70GB | false | Slightly lower quality with more space savings, recommended. | | GLM-Z1-32B-0414-Q40.gguf | Q40 | 18.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | GLM-Z1-32B-0414-IQ4NL.gguf | IQ4NL | 18.58GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | GLM-Z1-32B-0414-Q3KXL.gguf | Q3KXL | 18.03GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | GLM-Z1-32B-0414-IQ4XS.gguf | IQ4XS | 17.60GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | GLM-Z1-32B-0414-Q3KL.gguf | Q3KL | 17.22GB | false | Lower quality but usable, good for low RAM availability. | | GLM-Z1-32B-0414-Q3KM.gguf | Q3KM | 15.89GB | false | Low quality. | | GLM-Z1-32B-0414-IQ3M.gguf | IQ3M | 14.82GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | GLM-Z1-32B-0414-Q3KS.gguf | Q3KS | 14.37GB | false | Low quality, not recommended. | | GLM-Z1-32B-0414-IQ3XS.gguf | IQ3XS | 13.66GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | GLM-Z1-32B-0414-Q2KL.gguf | Q2KL | 13.20GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | GLM-Z1-32B-0414-IQ3XXS.gguf | IQ3XXS | 12.78GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | GLM-Z1-32B-0414-Q2K.gguf | Q2K | 12.29GB | false | Very low quality but surprisingly usable. | | GLM-Z1-32B-0414-IQ2M.gguf | IQ2M | 11.27GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | GLM-Z1-32B-0414-IQ2S.gguf | IQ2S | 10.42GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-Z1-32B-0414-IQ2XS.gguf | IQ2XS | 9.90GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (THUDMGLM-Z1-32B-0414-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
THUDM_GLM-4-9B-0414-GGUF
Qwen2.5-Math-7B-Instruct-GGUF
zai-org_GLM-4.6V-Flash-GGUF
NousResearch_Hermes-4-14B-GGUF
Llamacpp imatrix Quantizations of Hermes-4-14B by NousResearch Original model: https://huggingface.co/NousResearch/Hermes-4-14B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Hermes-4-14B-bf16.gguf | bf16 | 29.54GB | false | Full BF16 weights. | | Hermes-4-14B-Q80.gguf | Q80 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. | | Hermes-4-14B-Q6KL.gguf | Q6KL | 12.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Hermes-4-14B-Q6K.gguf | Q6K | 12.12GB | false | Very high quality, near perfect, recommended. | | Hermes-4-14B-Q5KL.gguf | Q5KL | 10.99GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Hermes-4-14B-Q5KM.gguf | Q5KM | 10.51GB | false | High quality, recommended. | | Hermes-4-14B-Q5KS.gguf | Q5KS | 10.26GB | false | High quality, recommended. | | Hermes-4-14B-Q4KL.gguf | Q4KL | 9.58GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Hermes-4-14B-Q41.gguf | Q41 | 9.39GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Hermes-4-14B-Q4KM.gguf | Q4KM | 9.00GB | false | Good quality, default size for most use cases, recommended. | | Hermes-4-14B-Q3KXL.gguf | Q3KXL | 8.58GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Hermes-4-14B-Q4KS.gguf | Q4KS | 8.57GB | false | Slightly lower quality with more space savings, recommended. | | Hermes-4-14B-Q40.gguf | Q40 | 8.54GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Hermes-4-14B-IQ4NL.gguf | IQ4NL | 8.54GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Hermes-4-14B-IQ4XS.gguf | IQ4XS | 8.11GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Hermes-4-14B-Q3KL.gguf | Q3KL | 7.90GB | false | Lower quality but usable, good for low RAM availability. | | Hermes-4-14B-Q3KM.gguf | Q3KM | 7.32GB | false | Low quality. | | Hermes-4-14B-IQ3M.gguf | IQ3M | 6.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Hermes-4-14B-Q3KS.gguf | Q3KS | 6.66GB | false | Low quality, not recommended. | | Hermes-4-14B-Q2KL.gguf | Q2KL | 6.51GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Hermes-4-14B-IQ3XS.gguf | IQ3XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Hermes-4-14B-IQ3XXS.gguf | IQ3XXS | 5.94GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Hermes-4-14B-Q2K.gguf | Q2K | 5.75GB | false | Very low quality but surprisingly usable. | | Hermes-4-14B-IQ2M.gguf | IQ2M | 5.32GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Hermes-4-14B-IQ2S.gguf | IQ2S | 4.96GB | false | Low quality, uses SOTA techniques to be usable. | | Hermes-4-14B-IQ2XS.gguf | IQ2XS | 4.69GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (NousResearchHermes-4-14B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
dolphin-2.9-llama3-8b-GGUF
gemma-2-27b-it-GGUF
mistralai_Ministral-3-14B-Instruct-2512-GGUF
TheDrummer_Rivermind-24B-v1-GGUF
trashpanda-org_QwQ-32B-Snowdrop-v0-GGUF
THUDM_GLM-4-32B-0414-GGUF
Qwen_Qwen3-4B-Instruct-2507-GGUF
Qwen2-VL-2B-Instruct-GGUF
L3-8B-Lunaris-v1-GGUF
inclusionAI_Ling-flash-2.0-GGUF
Llamacpp imatrix Quantizations of Ling-flash-2.0 by inclusionAI Original model: https://huggingface.co/inclusionAI/Ling-flash-2.0 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Ling-flash-2.0-Q80.gguf | Q80 | 109.42GB | true | Extremely high quality, generally unneeded but max available quant. | | Ling-flash-2.0-Q6K.gguf | Q6K | 84.61GB | true | Very high quality, near perfect, recommended. | | Ling-flash-2.0-Q5KM.gguf | Q5KM | 73.32GB | true | High quality, recommended. | | Ling-flash-2.0-Q5KS.gguf | Q5KS | 71.03GB | true | High quality, recommended. | | Ling-flash-2.0-Q41.gguf | Q41 | 64.64GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Ling-flash-2.0-Q4KL.gguf | Q4KL | 63.10GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | Ling-flash-2.0-Q4KM.gguf | Q4KM | 62.62GB | true | Good quality, default size for most use cases, recommended. | | Ling-flash-2.0-Q4KS.gguf | Q4KS | 60.37GB | true | Slightly lower quality with more space savings, recommended. | | Ling-flash-2.0-Q40.gguf | Q40 | 59.29GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Ling-flash-2.0-IQ4NL.gguf | IQ4NL | 58.35GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Ling-flash-2.0-IQ4XS.gguf | IQ4XS | 55.18GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | Ling-flash-2.0-Q3KXL.gguf | Q3KXL | 49.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Ling-flash-2.0-Q3KL.gguf | Q3KL | 49.01GB | false | Lower quality but usable, good for low RAM availability. | | Ling-flash-2.0-Q3KM.gguf | Q3KM | 47.13GB | false | Low quality. | | Ling-flash-2.0-IQ3M.gguf | IQ3M | 47.13GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Ling-flash-2.0-Q3KS.gguf | Q3KS | 44.90GB | false | Low quality, not recommended. | | Ling-flash-2.0-IQ3XS.gguf | IQ3XS | 42.48GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Ling-flash-2.0-IQ3XXS.gguf | IQ3XXS | 40.85GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Ling-flash-2.0-Q2KL.gguf | Q2KL | 36.88GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Ling-flash-2.0-Q2K.gguf | Q2K | 36.25GB | false | Very low quality but surprisingly usable. | | Ling-flash-2.0-IQ2M.gguf | IQ2M | 32.41GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Ling-flash-2.0-IQ2S.gguf | IQ2S | 28.66GB | false | Low quality, uses SOTA techniques to be usable. | | Ling-flash-2.0-IQ2XS.gguf | IQ2XS | 28.53GB | false | Low quality, uses SOTA techniques to be usable. | | Ling-flash-2.0-IQ2XXS.gguf | IQ2XXS | 25.82GB | false | Very low quality, uses SOTA techniques to be usable. | | Ling-flash-2.0-IQ1M.gguf | IQ1M | 22.22GB | false | Extremely low quality, not recommended. | | Ling-flash-2.0-IQ1S.gguf | IQ1S | 21.45GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (inclusionAILing-flash-2.0-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
ServiceNow-AI_Apriel-1.6-15b-Thinker-GGUF
Qwen2-VL-7B-Instruct-GGUF
microsoft_UserLM-8b-GGUF
Llamacpp imatrix Quantizations of UserLM-8b by microsoft Original model: https://huggingface.co/microsoft/UserLM-8b All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | UserLM-8b-bf16.gguf | bf16 | 16.07GB | false | Full BF16 weights. | | UserLM-8b-Q80.gguf | Q80 | 8.54GB | false | Extremely high quality, generally unneeded but max available quant. | | UserLM-8b-Q6KL.gguf | Q6KL | 6.85GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | UserLM-8b-Q6K.gguf | Q6K | 6.60GB | false | Very high quality, near perfect, recommended. | | UserLM-8b-Q5KL.gguf | Q5KL | 6.06GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | UserLM-8b-Q5KM.gguf | Q5KM | 5.73GB | false | High quality, recommended. | | UserLM-8b-Q5KS.gguf | Q5KS | 5.60GB | false | High quality, recommended. | | UserLM-8b-Q4KL.gguf | Q4KL | 5.31GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | UserLM-8b-Q41.gguf | Q41 | 5.13GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | UserLM-8b-Q4KM.gguf | Q4KM | 4.92GB | false | Good quality, default size for most use cases, recommended. | | UserLM-8b-Q3KXL.gguf | Q3KXL | 4.78GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | UserLM-8b-Q4KS.gguf | Q4KS | 4.69GB | false | Slightly lower quality with more space savings, recommended. | | UserLM-8b-Q40.gguf | Q40 | 4.68GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | UserLM-8b-IQ4NL.gguf | IQ4NL | 4.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | UserLM-8b-IQ4XS.gguf | IQ4XS | 4.45GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | UserLM-8b-Q3KL.gguf | Q3KL | 4.32GB | false | Lower quality but usable, good for low RAM availability. | | UserLM-8b-Q3KM.gguf | Q3KM | 4.02GB | false | Low quality. | | UserLM-8b-IQ3M.gguf | IQ3M | 3.78GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | UserLM-8b-Q2KL.gguf | Q2KL | 3.69GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | UserLM-8b-Q3KS.gguf | Q3KS | 3.66GB | false | Low quality, not recommended. | | UserLM-8b-IQ3XS.gguf | IQ3XS | 3.52GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | UserLM-8b-IQ3XXS.gguf | IQ3XXS | 3.27GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | UserLM-8b-Q2K.gguf | Q2K | 3.18GB | false | Very low quality but surprisingly usable. | | UserLM-8b-IQ2M.gguf | IQ2M | 2.95GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (microsoftUserLM-8b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
nvidia_NVIDIA-Nemotron-Nano-12B-v2-GGUF
Llamacpp imatrix Quantizations of NVIDIA-Nemotron-Nano-12B-v2 by nvidia Original model: https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | NVIDIA-Nemotron-Nano-12B-v2-bf16.gguf | bf16 | 24.63GB | false | Full BF16 weights. | | NVIDIA-Nemotron-Nano-12B-v2-Q80.gguf | Q80 | 13.09GB | false | Extremely high quality, generally unneeded but max available quant. | | NVIDIA-Nemotron-Nano-12B-v2-Q6KL.gguf | Q6KL | 10.44GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q6K.gguf | Q6K | 10.11GB | false | Very high quality, near perfect, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q5KL.gguf | Q5KL | 9.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q5KM.gguf | Q5KM | 8.76GB | false | High quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q5KS.gguf | Q5KS | 8.57GB | false | High quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q4KL.gguf | Q4KL | 7.99GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q41.gguf | Q41 | 7.84GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | NVIDIA-Nemotron-Nano-12B-v2-Q4KM.gguf | Q4KM | 7.49GB | false | Good quality, default size for most use cases, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q4KS.gguf | Q4KS | 7.21GB | false | Slightly lower quality with more space savings, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q40.gguf | Q40 | 7.16GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | NVIDIA-Nemotron-Nano-12B-v2-IQ4NL.gguf | IQ4NL | 7.11GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KXL.gguf | Q3KXL | 6.96GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | NVIDIA-Nemotron-Nano-12B-v2-IQ4XS.gguf | IQ4XS | 6.75GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KL.gguf | Q3KL | 6.37GB | false | Lower quality but usable, good for low RAM availability. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KM.gguf | Q3KM | 6.02GB | false | Low quality. | | NVIDIA-Nemotron-Nano-12B-v2-IQ3M.gguf | IQ3M | 5.69GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | NVIDIA-Nemotron-Nano-12B-v2-Q3KS.gguf | Q3KS | 5.57GB | false | Low quality, not recommended. | | NVIDIA-Nemotron-Nano-12B-v2-IQ3XS.gguf | IQ3XS | 5.46GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | NVIDIA-Nemotron-Nano-12B-v2-Q2KL.gguf | Q2KL | 5.36GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | NVIDIA-Nemotron-Nano-12B-v2-IQ3XXS.gguf | IQ3XXS | 4.96GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | NVIDIA-Nemotron-Nano-12B-v2-Q2K.gguf | Q2K | 4.70GB | false | Very low quality but surprisingly usable. | | NVIDIA-Nemotron-Nano-12B-v2-IQ2M.gguf | IQ2M | 4.38GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | NVIDIA-Nemotron-Nano-12B-v2-IQ2S.gguf | IQ2S | 4.07GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (nvidiaNVIDIA-Nemotron-Nano-12B-v2-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
huihui-ai_Huihui-gemma-3n-E4B-it-abliterated-GGUF
TheDrummer_Skyfall-31B-v4-GGUF
Llamacpp imatrix Quantizations of Skyfall-31B-v4 by TheDrummer Original model: https://huggingface.co/TheDrummer/Skyfall-31B-v4 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Skyfall-31B-v4-bf16.gguf | bf16 | 62.71GB | true | Full BF16 weights. | | Skyfall-31B-v4-Q80.gguf | Q80 | 33.32GB | false | Extremely high quality, generally unneeded but max available quant. | | Skyfall-31B-v4-Q6KL.gguf | Q6KL | 26.05GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Skyfall-31B-v4-Q6K.gguf | Q6K | 25.73GB | false | Very high quality, near perfect, recommended. | | Skyfall-31B-v4-Q5KL.gguf | Q5KL | 22.67GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Skyfall-31B-v4-Q5KM.gguf | Q5KM | 22.25GB | false | High quality, recommended. | | Skyfall-31B-v4-Q5KS.gguf | Q5KS | 21.65GB | false | High quality, recommended. | | Skyfall-31B-v4-Q41.gguf | Q41 | 19.74GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Skyfall-31B-v4-Q4KL.gguf | Q4KL | 19.48GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Skyfall-31B-v4-Q4KM.gguf | Q4KM | 18.98GB | false | Good quality, default size for most use cases, recommended. | | Skyfall-31B-v4-Q4KS.gguf | Q4KS | 17.95GB | false | Slightly lower quality with more space savings, recommended. | | Skyfall-31B-v4-Q40.gguf | Q40 | 17.88GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Skyfall-31B-v4-IQ4NL.gguf | IQ4NL | 17.85GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Skyfall-31B-v4-Q3KXL.gguf | Q3KXL | 17.03GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Skyfall-31B-v4-IQ4XS.gguf | IQ4XS | 16.90GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Skyfall-31B-v4-Q3KL.gguf | Q3KL | 16.44GB | false | Lower quality but usable, good for low RAM availability. | | Skyfall-31B-v4-Q3KM.gguf | Q3KM | 15.20GB | false | Low quality. | | Skyfall-31B-v4-IQ3M.gguf | IQ3M | 14.07GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Skyfall-31B-v4-Q3KS.gguf | Q3KS | 13.74GB | false | Low quality, not recommended. | | Skyfall-31B-v4-IQ3XS.gguf | IQ3XS | 13.07GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Skyfall-31B-v4-Q2KL.gguf | Q2KL | 12.38GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Skyfall-31B-v4-IQ3XXS.gguf | IQ3XXS | 12.26GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Skyfall-31B-v4-Q2K.gguf | Q2K | 11.73GB | false | Very low quality but surprisingly usable. | | Skyfall-31B-v4-IQ2M.gguf | IQ2M | 10.68GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Skyfall-31B-v4-IQ2S.gguf | IQ2S | 9.81GB | false | Low quality, uses SOTA techniques to be usable. | | Skyfall-31B-v4-IQ2XS.gguf | IQ2XS | 9.48GB | false | Low quality, uses SOTA techniques to be usable. | | Skyfall-31B-v4-IQ2XXS.gguf | IQ2XXS | 8.59GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerSkyfall-31B-v4-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Mistral-Small-Instruct-2409-GGUF
deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF
allenai_olmOCR-2-7B-1025-GGUF
TheDrummer_Cydonia-Redux-22B-v1.1-GGUF
Llamacpp imatrix Quantizations of Cydonia-Redux-22B-v1.1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Cydonia-Redux-22B-v1.1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Cydonia-Redux-22B-v1.1-bf16.gguf | bf16 | 44.50GB | false | Full BF16 weights. | | Cydonia-Redux-22B-v1.1-Q80.gguf | Q80 | 23.64GB | false | Extremely high quality, generally unneeded but max available quant. | | Cydonia-Redux-22B-v1.1-Q6KL.gguf | Q6KL | 18.35GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Cydonia-Redux-22B-v1.1-Q6K.gguf | Q6K | 18.25GB | false | Very high quality, near perfect, recommended. | | Cydonia-Redux-22B-v1.1-Q5KL.gguf | Q5KL | 15.85GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Cydonia-Redux-22B-v1.1-Q5KM.gguf | Q5KM | 15.72GB | false | High quality, recommended. | | Cydonia-Redux-22B-v1.1-Q5KS.gguf | Q5KS | 15.32GB | false | High quality, recommended. | | Cydonia-Redux-22B-v1.1-Q41.gguf | Q41 | 13.95GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Cydonia-Redux-22B-v1.1-Q4KL.gguf | Q4KL | 13.49GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Cydonia-Redux-22B-v1.1-Q4KM.gguf | Q4KM | 13.34GB | false | Good quality, default size for most use cases, recommended. | | Cydonia-Redux-22B-v1.1-Q4KS.gguf | Q4KS | 12.66GB | false | Slightly lower quality with more space savings, recommended. | | Cydonia-Redux-22B-v1.1-Q40.gguf | Q40 | 12.61GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Cydonia-Redux-22B-v1.1-IQ4NL.gguf | IQ4NL | 12.61GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Cydonia-Redux-22B-v1.1-IQ4XS.gguf | IQ4XS | 11.94GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Cydonia-Redux-22B-v1.1-Q3KXL.gguf | Q3KXL | 11.91GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Cydonia-Redux-22B-v1.1-Q3KL.gguf | Q3KL | 11.73GB | false | Lower quality but usable, good for low RAM availability. | | Cydonia-Redux-22B-v1.1-Q3KM.gguf | Q3KM | 10.76GB | false | Low quality. | | Cydonia-Redux-22B-v1.1-IQ3M.gguf | IQ3M | 10.06GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Cydonia-Redux-22B-v1.1-Q3KS.gguf | Q3KS | 9.64GB | false | Low quality, not recommended. | | Cydonia-Redux-22B-v1.1-IQ3XS.gguf | IQ3XS | 9.18GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Cydonia-Redux-22B-v1.1-IQ3XXS.gguf | IQ3XXS | 8.60GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Cydonia-Redux-22B-v1.1-Q2KL.gguf | Q2KL | 8.47GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Cydonia-Redux-22B-v1.1-Q2K.gguf | Q2K | 8.27GB | false | Very low quality but surprisingly usable. | | Cydonia-Redux-22B-v1.1-IQ2M.gguf | IQ2M | 7.62GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Cydonia-Redux-22B-v1.1-IQ2S.gguf | IQ2S | 7.04GB | false | Low quality, uses SOTA techniques to be usable. | | Cydonia-Redux-22B-v1.1-IQ2XS.gguf | IQ2XS | 6.65GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerCydonia-Redux-22B-v1.1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_QwQ-32B-GGUF
Original model: https://huggingface.co/Qwen/QwQ-32B All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | QwQ-32B-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | QwQ-32B-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | QwQ-32B-Q6K.gguf | Q6K | 26.89GB | false | Very high quality, near perfect, recommended. | | QwQ-32B-Q5KL.gguf | Q5KL | 23.74GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | QwQ-32B-Q5KM.gguf | Q5KM | 23.26GB | false | High quality, recommended. | | QwQ-32B-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | QwQ-32B-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | QwQ-32B-Q4KL.gguf | Q4KL | 20.43GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | QwQ-32B-Q4KM.gguf | Q4KM | 19.85GB | false | Good quality, default size for most use cases, recommended. | | QwQ-32B-Q4KS.gguf | Q4KS | 18.78GB | false | Slightly lower quality with more space savings, recommended. | | QwQ-32B-Q40.gguf | Q40 | 18.71GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | QwQ-32B-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | QwQ-32B-Q3KXL.gguf | Q3KXL | 17.93GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | QwQ-32B-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | QwQ-32B-Q3KL.gguf | Q3KL | 17.25GB | false | Lower quality but usable, good for low RAM availability. | | QwQ-32B-Q3KM.gguf | Q3KM | 15.94GB | false | Low quality. | | QwQ-32B-IQ3M.gguf | IQ3M | 14.81GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | QwQ-32B-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | QwQ-32B-IQ3XS.gguf | IQ3XS | 13.71GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | QwQ-32B-Q2KL.gguf | Q2KL | 13.07GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | QwQ-32B-IQ3XXS.gguf | IQ3XXS | 12.84GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | QwQ-32B-Q2K.gguf | Q2K | 12.31GB | false | Very low quality but surprisingly usable. | | QwQ-32B-IQ2M.gguf | IQ2M | 11.26GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | QwQ-32B-IQ2S.gguf | IQ2S | 10.39GB | false | Low quality, uses SOTA techniques to be usable. | | QwQ-32B-IQ2XS.gguf | IQ2XS | 9.96GB | false | Low quality, uses SOTA techniques to be usable. | | QwQ-32B-IQ2XXS.gguf | IQ2XXS | 9.03GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwQ-32B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
MiniMaxAI_MiniMax-M2.7-GGUF
tencent_Hunyuan-7B-Instruct-GGUF
gemma-2-9b-it-abliterated-GGUF
mlabonne_gemma-3-4b-it-abliterated-GGUF
cognitivecomputations_Dolphin3.0-R1-Mistral-24B-GGUF
Qwen2.5-32B-Instruct-GGUF
ai21labs_AI21-Jamba-Reasoning-3B-GGUF
google_gemma-3-12b-it-GGUF
QwQ-32B-Preview-GGUF
tencent_Hunyuan-1.8B-Instruct-GGUF
huizimao_gpt-oss-120b-uncensored-bf16-GGUF
Yi-1.5-9B-Chat-GGUF
TheDrummer_Tiger-Gemma-12B-v3-GGUF
aya-expanse-32b-GGUF
mistralai_Ministral-3-3B-Instruct-2512-GGUF
Qwen_Qwen3-0.6B-GGUF
openai_gpt-oss-120b-GGUF
Llamacpp imatrix Quantizations of gpt-oss-120b by openai Original model: https://huggingface.co/openai/gpt-oss-120b All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-120b-MXFP4MOE.gguf | MXFP4MOE | 63.39GB | true | Special format for OpenAI's gpt-oss models, see: https://github.com/ggml-org/llama.cpp/pull/15091 | The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything. The rest of these are provided for your own interest in case you feel like experimenting, but the size savings is basically non-existent so I would not recommend running them, they are provided simply for show: | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | gpt-oss-120b-bf16.gguf | bf16 | 65.37GB | true | Full BF16 weights. | | gpt-oss-120b-Q6K.gguf | Q6K | 63.28GB | true | Q6K with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q4KL.gguf | Q4KL | 63.06GB | true | Uses Q80 for embed and output weights. Q4KM with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q2KL.gguf | Q2KL | 63.00GB | true | Uses Q80 for embed and output weights. Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KXL.gguf | Q3KXL | 62.89GB | true | Uses Q80 for embed and output weights. Q3KL with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q4KM.gguf | Q4KM | 62.84GB | true | Q4KM with all FFN kept at MXFP4MOE | | gpt-oss-120b-Q41.gguf | Q41 | 62.74GB | true | Q41 with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ4NL.gguf | IQ4NL | 62.71GB | true | IQ4NL with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ4XS.gguf | IQ4XS | 62.71GB | true | IQ4XS with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KM.gguf | Q3KM | 62.71GB | true | Q3KM with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ3M.gguf | IQ3M | 62.71GB | true | IQ3M with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q2K.gguf | Q2K | 62.71GB | true | Q2K with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KS.gguf | Q3KS | 62.70GB | true | Q3KS with all FFN kept at MXFP4MOE. | | gpt-oss-120b-IQ2M.gguf | IQ2M | 62.69GB | true | IQ2M with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q40.gguf | Q40 | 62.65GB | true | Q40 with all FFN kept at MXFP4MOE. | | gpt-oss-120b-Q3KL.gguf | Q3KL | 62.60GB | true | Q3KL with all FFN kept at MXFP4MOE. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (openaigpt-oss-120b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
MiniMaxAI_MiniMax-M2-GGUF
Llamacpp imatrix Quantizations of MiniMax-M2 by MiniMaxAI Original model: https://huggingface.co/MiniMaxAI/MiniMax-M2 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | MiniMax-M2-Q80.gguf | Q80 | 243.14GB | true | Extremely high quality, generally unneeded but max available quant. | | MiniMax-M2-Q6K.gguf | Q6K | 187.81GB | true | Very high quality, near perfect, recommended. | | MiniMax-M2-Q5KM.gguf | Q5KM | 162.38GB | true | High quality, recommended. | | MiniMax-M2-Q5KS.gguf | Q5KS | 157.55GB | true | High quality, recommended. | | MiniMax-M2-Q41.gguf | Q41 | 143.31GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | MiniMax-M2-Q4KM.gguf | Q4KM | 138.59GB | true | Good quality, default size for most use cases, recommended. | | MiniMax-M2-Q4KS.gguf | Q4KS | 133.75GB | true | Slightly lower quality with more space savings, recommended. | | MiniMax-M2-Q40.gguf | Q40 | 131.34GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | MiniMax-M2-IQ4NL.gguf | IQ4NL | 129.24GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | MiniMax-M2-IQ4XS.gguf | IQ4XS | 122.17GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | MiniMax-M2-Q3KXL.gguf | Q3KXL | 108.74GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | MiniMax-M2-Q3KL.gguf | Q3KL | 108.21GB | true | Lower quality but usable, good for low RAM availability. | | MiniMax-M2-Q3KM.gguf | Q3KM | 103.96GB | true | Low quality. | | MiniMax-M2-IQ3M.gguf | IQ3M | 103.95GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | MiniMax-M2-Q3KS.gguf | Q3KS | 99.12GB | true | Low quality, not recommended. | | MiniMax-M2-IQ3XS.gguf | IQ3XS | 93.76GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | MiniMax-M2-IQ3XXS.gguf | IQ3XXS | 90.10GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | MiniMax-M2-Q2KL.gguf | Q2KL | 80.42GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | MiniMax-M2-Q2K.gguf | Q2K | 79.82GB | true | Very low quality but surprisingly usable. | | MiniMax-M2-IQ2M.gguf | IQ2M | 72.00GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | MiniMax-M2-IQ2S.gguf | IQ2S | 63.35GB | true | Low quality, uses SOTA techniques to be usable. | | MiniMax-M2-IQ2XS.gguf | IQ2XS | 63.14GB | true | Low quality, uses SOTA techniques to be usable. | | MiniMax-M2-IQ2XXS.gguf | IQ2XXS | 54.73GB | true | Very low quality, uses SOTA techniques to be usable. | | MiniMax-M2-IQ1M.gguf | IQ1M | 49.02GB | false | Extremely low quality, not recommended. | | MiniMax-M2-IQ1S.gguf | IQ1S | 47.01GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (MiniMaxAIMiniMax-M2-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
mistralai_Voxtral-Mini-3B-2507-GGUF
Llamacpp imatrix Quantizations of Voxtral-Mini-3B-2507 by mistralai Original model: https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Voxtral-Mini-3B-2507-bf16.gguf | bf16 | 8.04GB | false | Full BF16 weights. | | Voxtral-Mini-3B-2507-Q80.gguf | Q80 | 4.27GB | false | Extremely high quality, generally unneeded but max available quant. | | Voxtral-Mini-3B-2507-Q6KL.gguf | Q6KL | 3.50GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Voxtral-Mini-3B-2507-Q6K.gguf | Q6K | 3.30GB | false | Very high quality, near perfect, recommended. | | Voxtral-Mini-3B-2507-Q5KL.gguf | Q5KL | 3.12GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Voxtral-Mini-3B-2507-Q5KM.gguf | Q5KM | 2.87GB | false | High quality, recommended. | | Voxtral-Mini-3B-2507-Q5KS.gguf | Q5KS | 2.82GB | false | High quality, recommended. | | Voxtral-Mini-3B-2507-Q4KL.gguf | Q4KL | 2.77GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Voxtral-Mini-3B-2507-Q41.gguf | Q41 | 2.60GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Voxtral-Mini-3B-2507-Q3KXL.gguf | Q3KXL | 2.56GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Voxtral-Mini-3B-2507-Q4KM.gguf | Q4KM | 2.47GB | false | Good quality, default size for most use cases, recommended. | | Voxtral-Mini-3B-2507-Q4KS.gguf | Q4KS | 2.38GB | false | Slightly lower quality with more space savings, recommended. | | Voxtral-Mini-3B-2507-Q40.gguf | Q40 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Voxtral-Mini-3B-2507-IQ4NL.gguf | IQ4NL | 2.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Voxtral-Mini-3B-2507-IQ4XS.gguf | IQ4XS | 2.27GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Voxtral-Mini-3B-2507-Q3KL.gguf | Q3KL | 2.21GB | false | Lower quality but usable, good for low RAM availability. | | Voxtral-Mini-3B-2507-Q3KM.gguf | Q3KM | 2.06GB | false | Low quality. | | Voxtral-Mini-3B-2507-Q2KL.gguf | Q2KL | 2.05GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Voxtral-Mini-3B-2507-IQ3M.gguf | IQ3M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Voxtral-Mini-3B-2507-Q3KS.gguf | Q3KS | 1.89GB | false | Low quality, not recommended. | | Voxtral-Mini-3B-2507-IQ3XS.gguf | IQ3XS | 1.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Voxtral-Mini-3B-2507-IQ3XXS.gguf | IQ3XXS | 1.69GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Voxtral-Mini-3B-2507-Q2K.gguf | Q2K | 1.66GB | false | Very low quality but surprisingly usable. | | Voxtral-Mini-3B-2507-IQ2M.gguf | IQ2M | 1.56GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mistralaiVoxtral-Mini-3B-2507-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Phi-3.1-mini-4k-instruct-GGUF
Lexi-Llama-3-8B-Uncensored-GGUF
Qwen_Qwen3-Next-80B-A3B-Instruct-GGUF
Mistral-Small-22B-ArliAI-RPMax-v1.1-GGUF
OLMo-2-1124-13B-Instruct-GGUF
Goekdeniz-Guelmez_Josiefied-Qwen3-8B-abliterated-v1-GGUF
Mistral-Small-24B-Instruct-2501-GGUF
THUDM_GLM-Z1-9B-0414-GGUF
mistralai_Ministral-3-3B-Reasoning-2512-GGUF
TheDrummer_GLM-Steam-106B-A12B-v1-GGUF
TheDrummer_Cydonia-24B-v4-GGUF
DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored-GGUF
Phi-3-medium-128k-instruct-GGUF
agentica-org_DeepScaleR-1.5B-Preview-GGUF
Mistral-7B-Instruct-v0.3-GGUF
Qwen2.5-Math-1.5B-Instruct-GGUF
Rocinante-12B-v1.1-GGUF
Yi-1.5-6B-Chat-GGUF
mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF
Qwen2.5-Coder-3B-Instruct-GGUF
dolphin-2.9.1-llama-3-70b-GGUF
ibm-granite_granite-4.0-h-small-GGUF
Llamacpp imatrix Quantizations of granite-4.0-h-small by ibm-granite Original model: https://huggingface.co/ibm-granite/granite-4.0-h-small All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | granite-4.0-h-small-bf16.gguf | bf16 | 64.45GB | true | Full BF16 weights. | | granite-4.0-h-small-Q80.gguf | Q80 | 34.26GB | false | Extremely high quality, generally unneeded but max available quant. | | granite-4.0-h-small-Q6KL.gguf | Q6KL | 26.75GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | granite-4.0-h-small-Q6K.gguf | Q6K | 26.65GB | false | Very high quality, near perfect, recommended. | | granite-4.0-h-small-Q5KL.gguf | Q5KL | 23.24GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | granite-4.0-h-small-Q5KM.gguf | Q5KM | 23.14GB | false | High quality, recommended. | | granite-4.0-h-small-Q5KS.gguf | Q5KS | 22.44GB | false | High quality, recommended. | | granite-4.0-h-small-Q41.gguf | Q41 | 20.46GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | granite-4.0-h-small-Q4KL.gguf | Q4KL | 19.85GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | granite-4.0-h-small-Q4KM.gguf | Q4KM | 19.75GB | false | Good quality, default size for most use cases, recommended. | | granite-4.0-h-small-Q4KS.gguf | Q4KS | 19.10GB | false | Slightly lower quality with more space savings, recommended. | | granite-4.0-h-small-Q40.gguf | Q40 | 18.80GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | granite-4.0-h-small-IQ4NL.gguf | IQ4NL | 18.53GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | granite-4.0-h-small-IQ4XS.gguf | IQ4XS | 17.56GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | granite-4.0-h-small-Q3KXL.gguf | Q3KXL | 15.67GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | granite-4.0-h-small-Q3KL.gguf | Q3KL | 15.57GB | false | Lower quality but usable, good for low RAM availability. | | granite-4.0-h-small-Q3KM.gguf | Q3KM | 15.02GB | false | Low quality. | | granite-4.0-h-small-IQ3M.gguf | IQ3M | 15.02GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | granite-4.0-h-small-Q3KS.gguf | Q3KS | 14.42GB | false | Low quality, not recommended. | | granite-4.0-h-small-IQ3XS.gguf | IQ3XS | 13.78GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | granite-4.0-h-small-IQ3XXS.gguf | IQ3XXS | 13.12GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | granite-4.0-h-small-Q2KL.gguf | Q2KL | 11.84GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | granite-4.0-h-small-Q2K.gguf | Q2K | 11.74GB | false | Very low quality but surprisingly usable. | | granite-4.0-h-small-IQ2M.gguf | IQ2M | 10.47GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | granite-4.0-h-small-IQ2S.gguf | IQ2S | 9.30GB | false | Low quality, uses SOTA techniques to be usable. | | granite-4.0-h-small-IQ2XS.gguf | IQ2XS | 9.29GB | false | Low quality, uses SOTA techniques to be usable. | | granite-4.0-h-small-IQ2XXS.gguf | IQ2XXS | 8.15GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ibm-granitegranite-4.0-h-small-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
deepseek-r1-qwen-2.5-32B-ablated-GGUF
microsoft_Phi-4-mini-instruct-GGUF
TheDrummer_Valkyrie-49B-v2-GGUF
Qwen_Qwen3-VL-32B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-32B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-32B-Instruct-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | Qwen3-VL-32B-Instruct-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-32B-Instruct-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Instruct-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Instruct-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-32B-Instruct-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | Qwen3-VL-32B-Instruct-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen3-VL-32B-Instruct-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-32B-Instruct-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-32B-Instruct-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-32B-Instruct-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-32B-Instruct-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-32B-Instruct-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-32B-Instruct-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Instruct-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-32B-Instruct-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Instruct-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | Qwen3-VL-32B-Instruct-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-32B-Instruct-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen3-VL-32B-Instruct-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-32B-Instruct-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-32B-Instruct-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-32B-Instruct-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-32B-Instruct-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-32B-Instruct-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Instruct-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Instruct-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-32B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
EuroLLM-9B-Instruct-GGUF
Tesslate_Tessa-T1-3B-GGUF
mistralai_Ministral-3-8B-Reasoning-2512-GGUF
TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF
TheBeagle-v2beta-32B-MGS-GGUF
Llama-3.1-Nemotron-70B-Instruct-HF-GGUF
google_gemma-3-1b-it-GGUF
Qwen2.5-1.5B-Instruct-GGUF
Phi-3.5-mini-instruct_Uncensored-GGUF
Qwen2.5-0.5B-Instruct-GGUF
Kwaipilot_KAT-Dev-72B-Exp-GGUF
Llamacpp imatrix Quantizations of KAT-Dev-72B-Exp by Kwaipilot Original model: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp All quants made using imatrix option with dataset from here combined ...
mlabonne_Qwen3-8B-abliterated-GGUF
HuggingFaceTB_SmolLM3-3B-GGUF
Qwen_Qwen3-VL-30B-A3B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-30B-A3B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-30B-A3B-Instruct-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Qwen3-VL-30B-A3B-Instruct-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-30B-A3B-Instruct-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-30B-A3B-Instruct-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-30B-A3B-Instruct-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-30B-A3B-Instruct-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-30B-A3B-Instruct-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Instruct-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Instruct-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Qwen3-VL-30B-A3B-Instruct-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-30B-A3B-Instruct-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Qwen3-VL-30B-A3B-Instruct-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-30B-A3B-Instruct-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-30B-A3B-Instruct-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Instruct-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Instruct-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-30B-A3B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
RekaAI_reka-flash-3-GGUF
Qwen2.5.1-Coder-7B-Instruct-GGUF
deepcogito_cogito-v1-preview-qwen-32B-GGUF
mistralai_Ministral-3-8B-Instruct-2512-GGUF
Ilya626_Cydonia_Vistral-GGUF
Llamacpp imatrix Quantizations of CydoniaVistral by Ilya626 Original model: https://huggingface.co/Ilya626/CydoniaVistral All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | CydoniaVistral-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | CydoniaVistral-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | CydoniaVistral-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | CydoniaVistral-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | CydoniaVistral-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | CydoniaVistral-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | CydoniaVistral-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | CydoniaVistral-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | CydoniaVistral-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | CydoniaVistral-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | CydoniaVistral-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | CydoniaVistral-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | CydoniaVistral-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | CydoniaVistral-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | CydoniaVistral-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | CydoniaVistral-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | CydoniaVistral-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | CydoniaVistral-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | CydoniaVistral-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | CydoniaVistral-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | CydoniaVistral-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | CydoniaVistral-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | CydoniaVistral-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | CydoniaVistral-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | CydoniaVistral-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | CydoniaVistral-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Ilya626CydoniaVistral-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
phi-4-GGUF
TheDrummer_Snowpiercer-15B-v3-GGUF
Qwen_Qwen3-32B-GGUF
PokeeAI_pokee_research_7b-GGUF
Llamacpp imatrix Quantizations of pokeeresearch7b by PokeeAI Original model: https://huggingface.co/PokeeAI/pokeeresearch7b All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | pokeeresearch7b-bf16.gguf | bf16 | 15.24GB | false | Full BF16 weights. | | pokeeresearch7b-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | pokeeresearch7b-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | pokeeresearch7b-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | pokeeresearch7b-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | pokeeresearch7b-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | pokeeresearch7b-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | pokeeresearch7b-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | pokeeresearch7b-Q41.gguf | Q41 | 4.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | pokeeresearch7b-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for most use cases, recommended. | | pokeeresearch7b-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | pokeeresearch7b-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | pokeeresearch7b-Q40.gguf | Q40 | 4.44GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | pokeeresearch7b-IQ4NL.gguf | IQ4NL | 4.44GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | pokeeresearch7b-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | pokeeresearch7b-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | pokeeresearch7b-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | pokeeresearch7b-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | pokeeresearch7b-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | pokeeresearch7b-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | pokeeresearch7b-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | pokeeresearch7b-IQ3XXS.gguf | IQ3XXS | 3.11GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | pokeeresearch7b-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | pokeeresearch7b-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (PokeeAIpokeeresearch7b-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
TheDrummer_Behemoth-X-123B-v2-GGUF
OpenGVLab_InternVL3_5-30B-A3B-GGUF
MN-12B-Lyra-v4-GGUF
Qwen_Qwen3-VL-4B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-4B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-4B-Instruct-bf16.gguf | bf16 | 8.05GB | false | Full BF16 weights. | | Qwen3-VL-4B-Instruct-Q80.gguf | Q80 | 4.28GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-4B-Instruct-Q6KL.gguf | Q6KL | 3.40GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-4B-Instruct-Q6K.gguf | Q6K | 3.31GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-4B-Instruct-Q5KL.gguf | Q5KL | 2.98GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-4B-Instruct-Q5KM.gguf | Q5KM | 2.89GB | false | High quality, recommended. | | Qwen3-VL-4B-Instruct-Q5KS.gguf | Q5KS | 2.82GB | false | High quality, recommended. | | Qwen3-VL-4B-Instruct-Q41.gguf | Q41 | 2.60GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-4B-Instruct-Q4KL.gguf | Q4KL | 2.59GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-4B-Instruct-Q4KM.gguf | Q4KM | 2.50GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-4B-Instruct-Q4KS.gguf | Q4KS | 2.38GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-4B-Instruct-Q40.gguf | Q40 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-4B-Instruct-IQ4NL.gguf | IQ4NL | 2.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-4B-Instruct-Q3KXL.gguf | Q3KXL | 2.33GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-4B-Instruct-IQ4XS.gguf | IQ4XS | 2.27GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-4B-Instruct-Q3KL.gguf | Q3KL | 2.24GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-4B-Instruct-Q3KM.gguf | Q3KM | 2.08GB | false | Low quality. | | Qwen3-VL-4B-Instruct-IQ3M.gguf | IQ3M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-4B-Instruct-Q3KS.gguf | Q3KS | 1.89GB | false | Low quality, not recommended. | | Qwen3-VL-4B-Instruct-IQ3XS.gguf | IQ3XS | 1.81GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-4B-Instruct-Q2KL.gguf | Q2KL | 1.76GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-4B-Instruct-IQ3XXS.gguf | IQ3XXS | 1.67GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-4B-Instruct-Q2K.gguf | Q2K | 1.67GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-4B-Instruct-IQ2M.gguf | IQ2M | 1.51GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-4B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
mistralai_Devstral-Small-2507-GGUF
Llama-3.3-70B-Instruct-abliterated-GGUF
Dolphin3.0-Llama3.1-8B-GGUF
aya-expanse-8b-GGUF
internlm_JanusCoderV-7B-GGUF
Llamacpp imatrix Quantizations of JanusCoderV-7B by internlm Original model: https://huggingface.co/internlm/JanusCoderV-7B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | JanusCoderV-7B-bf16.gguf | bf16 | 15.24GB | false | Full BF16 weights. | | JanusCoderV-7B-Q80.gguf | Q80 | 8.10GB | false | Extremely high quality, generally unneeded but max available quant. | | JanusCoderV-7B-Q6KL.gguf | Q6KL | 6.52GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | JanusCoderV-7B-Q6K.gguf | Q6K | 6.25GB | false | Very high quality, near perfect, recommended. | | JanusCoderV-7B-Q5KL.gguf | Q5KL | 5.78GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | JanusCoderV-7B-Q5KM.gguf | Q5KM | 5.44GB | false | High quality, recommended. | | JanusCoderV-7B-Q5KS.gguf | Q5KS | 5.32GB | false | High quality, recommended. | | JanusCoderV-7B-Q4KL.gguf | Q4KL | 5.09GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | JanusCoderV-7B-Q41.gguf | Q41 | 4.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | JanusCoderV-7B-Q4KM.gguf | Q4KM | 4.68GB | false | Good quality, default size for most use cases, recommended. | | JanusCoderV-7B-Q3KXL.gguf | Q3KXL | 4.57GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | JanusCoderV-7B-Q4KS.gguf | Q4KS | 4.46GB | false | Slightly lower quality with more space savings, recommended. | | JanusCoderV-7B-Q40.gguf | Q40 | 4.44GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | JanusCoderV-7B-IQ4NL.gguf | IQ4NL | 4.44GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | JanusCoderV-7B-IQ4XS.gguf | IQ4XS | 4.22GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | JanusCoderV-7B-Q3KL.gguf | Q3KL | 4.09GB | false | Lower quality but usable, good for low RAM availability. | | JanusCoderV-7B-Q3KM.gguf | Q3KM | 3.81GB | false | Low quality. | | JanusCoderV-7B-IQ3M.gguf | IQ3M | 3.57GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | JanusCoderV-7B-Q2KL.gguf | Q2KL | 3.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | JanusCoderV-7B-Q3KS.gguf | Q3KS | 3.49GB | false | Low quality, not recommended. | | JanusCoderV-7B-IQ3XS.gguf | IQ3XS | 3.35GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | JanusCoderV-7B-IQ3XXS.gguf | IQ3XXS | 3.11GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | JanusCoderV-7B-Q2K.gguf | Q2K | 3.02GB | false | Very low quality but surprisingly usable. | | JanusCoderV-7B-IQ2M.gguf | IQ2M | 2.78GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (internlmJanusCoderV-7B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Trappu_Magnum-Picaro-0.7-v2-12b-GGUF
Meta-Llama-3-70B-Instruct-GGUF
internlm_JanusCoder-14B-GGUF
nvidia_Qwen3-Nemotron-32B-RLBFF-GGUF
Llamacpp imatrix Quantizations of Qwen3-Nemotron-32B-RLBFF by nvidia Original model: https://huggingface.co/nvidia/Qwen3-Nemotron-32B-RLBFF All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-Nemotron-32B-RLBFF-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | Qwen3-Nemotron-32B-RLBFF-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-Nemotron-32B-RLBFF-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-Nemotron-32B-RLBFF-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-Nemotron-32B-RLBFF-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-Nemotron-32B-RLBFF-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-Nemotron-32B-RLBFF-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-Nemotron-32B-RLBFF-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-Nemotron-32B-RLBFF-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | Qwen3-Nemotron-32B-RLBFF-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-Nemotron-32B-RLBFF-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen3-Nemotron-32B-RLBFF-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-Nemotron-32B-RLBFF-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-Nemotron-32B-RLBFF-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-Nemotron-32B-RLBFF-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-Nemotron-32B-RLBFF-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (nvidiaQwen3-Nemotron-32B-RLBFF-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen2.5-VL-32B-Instruct-GGUF
Qwen_Qwen3-VL-8B-Instruct-GGUF
Qwen_Qwen3-235B-A22B-Instruct-2507-GGUF
zai-org_GLM-4.5-Air-GGUF
Llamacpp imatrix Quantizations of GLM-4.5-Air by zai-org Original model: https://huggingface.co/zai-org/GLM-4.5-Air All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | GLM-4.5-Air-Q80.gguf | Q80 | 117.46GB | true | Extremely high quality, generally unneeded but max available quant. | | GLM-4.5-Air-Q6K.gguf | Q6K | 99.18GB | true | Very high quality, near perfect, recommended. | | GLM-4.5-Air-Q5KM.gguf | Q5KM | 83.72GB | true | High quality, recommended. | | GLM-4.5-Air-Q5KS.gguf | Q5KS | 78.55GB | true | High quality, recommended. | | GLM-4.5-Air-Q4KM.gguf | Q4KM | 73.50GB | true | Good quality, default size for most use cases, recommended. | | GLM-4.5-Air-Q41.gguf | Q41 | 69.55GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | GLM-4.5-Air-Q4KS.gguf | Q4KS | 68.31GB | true | Slightly lower quality with more space savings, recommended. | | GLM-4.5-Air-Q40.gguf | Q40 | 63.76GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | GLM-4.5-Air-IQ4NL.gguf | IQ4NL | 63.06GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | GLM-4.5-Air-IQ4XS.gguf | IQ4XS | 60.81GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | GLM-4.5-Air-Q3KXL.gguf | Q3KXL | 56.45GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | GLM-4.5-Air-Q3KL.gguf | Q3KL | 55.91GB | true | Lower quality but usable, good for low RAM availability. | | GLM-4.5-Air-Q3KM.gguf | Q3KM | 55.48GB | true | Low quality. | | GLM-4.5-Air-IQ3M.gguf | IQ3M | 55.48GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | GLM-4.5-Air-Q3KS.gguf | Q3KS | 53.42GB | true | Low quality, not recommended. | | GLM-4.5-Air-IQ3XS.gguf | IQ3XS | 50.84GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | GLM-4.5-Air-IQ3XXS.gguf | IQ3XXS | 50.34GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | GLM-4.5-Air-Q2KL.gguf | Q2KL | 46.71GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | GLM-4.5-Air-Q2K.gguf | Q2K | 46.10GB | false | Very low quality but surprisingly usable. | | GLM-4.5-Air-IQ2M.gguf | IQ2M | 45.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | GLM-4.5-Air-IQ2S.gguf | IQ2S | 42.54GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Air-IQ2XS.gguf | IQ2XS | 42.19GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Air-IQ2XXS.gguf | IQ2XXS | 39.62GB | false | Very low quality, uses SOTA techniques to be usable. | | GLM-4.5-Air-IQ1M.gguf | IQ1M | 37.86GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (zai-orgGLM-4.5-Air-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
gemma-2-2b-it-abliterated-GGUF
Qwen2.5-Coder-3B-GGUF
TheDrummer_Behemoth-X-123B-v2.1-GGUF
Codestral-22B-v0.1-GGUF
Llamacpp imatrix Quantizations of Codestral-22B-v0.1 Original model: https://huggingface.co/mistralai/Codestral-22B-v0.1 All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Description | | -------- | ---------- | --------- | ----------- | | Codestral-22B-v0.1-Q80.gguf | Q80 | 23.64GB | Extremely high quality, generally unneeded but max available quant. | | Codestral-22B-v0.1-Q6K.gguf | Q6K | 18.25GB | Very high quality, near perfect, recommended. | | Codestral-22B-v0.1-Q5KM.gguf | Q5KM | 15.72GB | High quality, recommended. | | Codestral-22B-v0.1-Q5KS.gguf | Q5KS | 15.32GB | High quality, recommended. | | Codestral-22B-v0.1-Q4KM.gguf | Q4KM | 13.34GB | Good quality, uses about 4.83 bits per weight, recommended. | | Codestral-22B-v0.1-Q4KS.gguf | Q4KS | 12.66GB | Slightly lower quality with more space savings, recommended. | | Codestral-22B-v0.1-IQ4XS.gguf | IQ4XS | 11.93GB | Decent quality, smaller than Q4KS with similar performance, recommended. | | Codestral-22B-v0.1-Q3KL.gguf | Q3KL | 11.73GB | Lower quality but usable, good for low RAM availability. | | Codestral-22B-v0.1-Q3KM.gguf | Q3KM | 10.75GB | Even lower quality. | | Codestral-22B-v0.1-IQ3M.gguf | IQ3M | 10.06GB | Medium-low quality, new method with decent performance comparable to Q3KM. | | Codestral-22B-v0.1-Q3KS.gguf | Q3KS | 9.64GB | Low quality, not recommended. | | Codestral-22B-v0.1-IQ3XS.gguf | IQ3XS | 9.17GB | Lower quality, new method with decent performance, slightly better than Q3KS. | | Codestral-22B-v0.1-IQ3XXS.gguf | IQ3XXS | 8.59GB | Lower quality, new method with decent performance, comparable to Q3 quants. | | Codestral-22B-v0.1-Q2K.gguf | Q2K | 8.27GB | Very low quality but surprisingly usable. | | Codestral-22B-v0.1-IQ2M.gguf | IQ2M | 7.61GB | Very low quality, uses SOTA techniques to also be surprisingly usable. | | Codestral-22B-v0.1-IQ2S.gguf | IQ2S | 7.03GB | Very low quality, uses SOTA techniques to be usable. | | Codestral-22B-v0.1-IQ2XS.gguf | IQ2XS | 6.64GB | Very low quality, uses SOTA techniques to be usable. | First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (Codestral-22B-v0.1-Q80) or download them all in place (./) A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
SicariusSicariiStuff_X-Ray_Alpha-GGUF
google_gemma-3-4b-it-qat-GGUF
granite-3.1-8b-instruct-GGUF
Qwen_Qwen3-VL-30B-A3B-Thinking-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-30B-A3B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-30B-A3B-Thinking-bf16.gguf | bf16 | 61.10GB | true | Full BF16 weights. | | Qwen3-VL-30B-A3B-Thinking-Q80.gguf | Q80 | 32.48GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-30B-A3B-Thinking-Q6KL.gguf | Q6KL | 25.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q6K.gguf | Q6K | 25.10GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q5KL.gguf | Q5KL | 21.94GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q5KM.gguf | Q5KM | 21.74GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q5KS.gguf | Q5KS | 21.10GB | false | High quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q41.gguf | Q41 | 19.21GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-30B-A3B-Thinking-Q4KL.gguf | Q4KL | 18.86GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q4KM.gguf | Q4KM | 18.63GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q4KS.gguf | Q4KS | 17.98GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q40.gguf | Q40 | 17.63GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-30B-A3B-Thinking-IQ4NL.gguf | IQ4NL | 17.39GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-30B-A3B-Thinking-IQ4XS.gguf | IQ4XS | 16.46GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-30B-A3B-Thinking-Q3KXL.gguf | Q3KXL | 14.86GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Thinking-Q3KL.gguf | Q3KL | 14.58GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-30B-A3B-Thinking-Q3KM.gguf | Q3KM | 14.08GB | false | Low quality. | | Qwen3-VL-30B-A3B-Thinking-IQ3M.gguf | IQ3M | 14.08GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-30B-A3B-Thinking-Q3KS.gguf | Q3KS | 13.43GB | false | Low quality, not recommended. | | Qwen3-VL-30B-A3B-Thinking-IQ3XS.gguf | IQ3XS | 12.74GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-30B-A3B-Thinking-IQ3XXS.gguf | IQ3XXS | 12.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-30B-A3B-Thinking-Q2KL.gguf | Q2KL | 11.21GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Thinking-Q2K.gguf | Q2K | 10.91GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2M.gguf | IQ2M | 9.87GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2S.gguf | IQ2S | 8.74GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2XS.gguf | IQ2XS | 8.66GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-30B-A3B-Thinking-IQ2XXS.gguf | IQ2XXS | 7.57GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-30B-A3B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
google_gemma-3-270m-it-GGUF
nvidia_Nemotron-3-Nano-30B-A3B-GGUF
Qwen2.5-Coder-0.5B-GGUF
THUDM_GLM-Z1-Rumination-32B-0414-GGUF
mistralai_Voxtral-Small-24B-2507-GGUF
Llamacpp imatrix Quantizations of Voxtral-Small-24B-2507 by mistralai Original model: https://huggingface.co/mistralai/Voxtral-Small-24B-2507 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Voxtral-Small-24B-2507-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | Voxtral-Small-24B-2507-Q80.gguf | Q80 | 25.06GB | false | Extremely high quality, generally unneeded but max available quant. | | Voxtral-Small-24B-2507-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Voxtral-Small-24B-2507-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | Voxtral-Small-24B-2507-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Voxtral-Small-24B-2507-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | Voxtral-Small-24B-2507-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | Voxtral-Small-24B-2507-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Voxtral-Small-24B-2507-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Voxtral-Small-24B-2507-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | Voxtral-Small-24B-2507-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | Voxtral-Small-24B-2507-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Voxtral-Small-24B-2507-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Voxtral-Small-24B-2507-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Voxtral-Small-24B-2507-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Voxtral-Small-24B-2507-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | Voxtral-Small-24B-2507-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | Voxtral-Small-24B-2507-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Voxtral-Small-24B-2507-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | Voxtral-Small-24B-2507-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Voxtral-Small-24B-2507-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Voxtral-Small-24B-2507-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Voxtral-Small-24B-2507-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | Voxtral-Small-24B-2507-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Voxtral-Small-24B-2507-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | Voxtral-Small-24B-2507-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (mistralaiVoxtral-Small-24B-2507-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen2.5-Coder-14B-Instruct-GGUF
google_gemma-3-12b-it-qat-GGUF
TheDrummer_Gemmasutra-Small-4B-v1-GGUF
OpenGVLab_InternVL3_5-38B-GGUF
Llamacpp imatrix Quantizations of InternVL35-38B by OpenGVLab Original model: https://huggingface.co/OpenGVLab/InternVL35-38B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | InternVL35-38B-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | InternVL35-38B-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | InternVL35-38B-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | InternVL35-38B-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | InternVL35-38B-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | InternVL35-38B-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | InternVL35-38B-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | InternVL35-38B-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | InternVL35-38B-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | InternVL35-38B-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | InternVL35-38B-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | InternVL35-38B-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | InternVL35-38B-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | InternVL35-38B-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | InternVL35-38B-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | InternVL35-38B-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | InternVL35-38B-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | InternVL35-38B-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | InternVL35-38B-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | InternVL35-38B-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | InternVL35-38B-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | InternVL35-38B-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | InternVL35-38B-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | InternVL35-38B-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | InternVL35-38B-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | InternVL35-38B-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | InternVL35-38B-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (OpenGVLabInternVL35-38B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen2.5-3B-Instruct-GGUF
DeepSeek-R1-Distill-Llama-70B-GGUF
L3.3-MS-Nevoria-70b-GGUF
Dolphin3.0-Llama3.2-1B-GGUF
Cat-Llama-3-70B-instruct-GGUF
Dolphin3.0-Llama3.2-3B-GGUF
Qwen_Qwen3-VL-32B-Thinking-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-32B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-32B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-32B-Thinking-bf16.gguf | bf16 | 65.53GB | true | Full BF16 weights. | | Qwen3-VL-32B-Thinking-Q80.gguf | Q80 | 34.82GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-32B-Thinking-Q6KL.gguf | Q6KL | 27.26GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Thinking-Q6K.gguf | Q6K | 26.88GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-32B-Thinking-Q5KL.gguf | Q5KL | 23.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-32B-Thinking-Q5KM.gguf | Q5KM | 23.21GB | false | High quality, recommended. | | Qwen3-VL-32B-Thinking-Q5KS.gguf | Q5KS | 22.64GB | false | High quality, recommended. | | Qwen3-VL-32B-Thinking-Q41.gguf | Q41 | 20.64GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-32B-Thinking-Q4KL.gguf | Q4KL | 20.34GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-32B-Thinking-Q4KM.gguf | Q4KM | 19.76GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-32B-Thinking-Q4KS.gguf | Q4KS | 18.77GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-32B-Thinking-Q40.gguf | Q40 | 18.70GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-32B-Thinking-IQ4NL.gguf | IQ4NL | 18.68GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-32B-Thinking-Q3KXL.gguf | Q3KXL | 18.01GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Thinking-IQ4XS.gguf | IQ4XS | 17.69GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-32B-Thinking-Q3KL.gguf | Q3KL | 17.33GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-32B-Thinking-Q3KM.gguf | Q3KM | 15.97GB | false | Low quality. | | Qwen3-VL-32B-Thinking-IQ3M.gguf | IQ3M | 14.93GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-32B-Thinking-Q3KS.gguf | Q3KS | 14.39GB | false | Low quality, not recommended. | | Qwen3-VL-32B-Thinking-IQ3XS.gguf | IQ3XS | 13.70GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-32B-Thinking-Q2KL.gguf | Q2KL | 13.10GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-32B-Thinking-IQ3XXS.gguf | IQ3XXS | 12.82GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-32B-Thinking-Q2K.gguf | Q2K | 12.34GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-32B-Thinking-IQ2M.gguf | IQ2M | 11.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Qwen3-VL-32B-Thinking-IQ2S.gguf | IQ2S | 10.51GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Thinking-IQ2XS.gguf | IQ2XS | 9.95GB | false | Low quality, uses SOTA techniques to be usable. | | Qwen3-VL-32B-Thinking-IQ2XXS.gguf | IQ2XXS | 9.02GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-32B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
EVA-LLaMA-3.33-70B-v0.0-GGUF
Qwen_Qwen3-VL-8B-Thinking-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-8B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-8B-Thinking-bf16.gguf | bf16 | 16.39GB | false | Full BF16 weights. | | Qwen3-VL-8B-Thinking-Q80.gguf | Q80 | 8.71GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-8B-Thinking-Q6KL.gguf | Q6KL | 7.03GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-8B-Thinking-Q6K.gguf | Q6K | 6.73GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-8B-Thinking-Q5KL.gguf | Q5KL | 6.24GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-8B-Thinking-Q5KM.gguf | Q5KM | 5.85GB | false | High quality, recommended. | | Qwen3-VL-8B-Thinking-Q5KS.gguf | Q5KS | 5.72GB | false | High quality, recommended. | | Qwen3-VL-8B-Thinking-Q4KL.gguf | Q4KL | 5.49GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-8B-Thinking-Q41.gguf | Q41 | 5.25GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-8B-Thinking-Q4KM.gguf | Q4KM | 5.03GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-8B-Thinking-Q3KXL.gguf | Q3KXL | 4.98GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-8B-Thinking-Q4KS.gguf | Q4KS | 4.80GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-8B-Thinking-Q40.gguf | Q40 | 4.79GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-8B-Thinking-IQ4NL.gguf | IQ4NL | 4.79GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-8B-Thinking-IQ4XS.gguf | IQ4XS | 4.56GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-8B-Thinking-Q3KL.gguf | Q3KL | 4.43GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-8B-Thinking-Q3KM.gguf | Q3KM | 4.12GB | false | Low quality. | | Qwen3-VL-8B-Thinking-IQ3M.gguf | IQ3M | 3.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-8B-Thinking-Q2KL.gguf | Q2KL | 3.89GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-8B-Thinking-Q3KS.gguf | Q3KS | 3.77GB | false | Low quality, not recommended. | | Qwen3-VL-8B-Thinking-IQ3XS.gguf | IQ3XS | 3.63GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-8B-Thinking-IQ3XXS.gguf | IQ3XXS | 3.37GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-8B-Thinking-Q2K.gguf | Q2K | 3.28GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-8B-Thinking-IQ2M.gguf | IQ2M | 3.05GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-8B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen_Qwen3-14B-GGUF
Gryphe_Codex-24B-Small-3.2-GGUF
Qwen_Qwen3-4B-Thinking-2507-GGUF
TheDrummer_Fallen-Gemma3-27B-v1-GGUF
Kwaipilot_KAT-Dev-GGUF
MiniCPM-V-2_6-GGUF
MathCoder2-CodeLlama-7B-GGUF
baidu_ERNIE-4.5-21B-A3B-PT-GGUF
huihui-ai_QwQ-32B-abliterated-GGUF
TheDrummer_Anubis-70B-v1.1-GGUF
cognitivecomputations_Dolphin3.0-Mistral-24B-GGUF
SmolLM2-135M-Instruct-GGUF
Llama-3_1-Nemotron-51B-Instruct-GGUF
Llama-3.3-70B-Instruct-ablated-GGUF
OLMo-2-1124-7B-Instruct-GGUF
BlackSheep-RP-12B-GGUF
huihui-ai_DeepSeek-R1-Distill-Llama-70B-abliterated-GGUF
UI-TARS-7B-DPO-GGUF
TheDrummer_Cydonia-R1-24B-v4.1-GGUF
ibm-granite_granite-4.0-h-micro-GGUF
Llamacpp imatrix Quantizations of granite-4.0-h-micro by ibm-granite Original model: https://huggingface.co/ibm-granite/granite-4.0-h-micro All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | granite-4.0-h-micro-bf16.gguf | bf16 | 6.39GB | false | Full BF16 weights. | | granite-4.0-h-micro-Q80.gguf | Q80 | 3.40GB | false | Extremely high quality, generally unneeded but max available quant. | | granite-4.0-h-micro-Q6KL.gguf | Q6KL | 2.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | granite-4.0-h-micro-Q6K.gguf | Q6K | 2.63GB | false | Very high quality, near perfect, recommended. | | granite-4.0-h-micro-Q5KL.gguf | Q5KL | 2.32GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | granite-4.0-h-micro-Q5KM.gguf | Q5KM | 2.27GB | false | High quality, recommended. | | granite-4.0-h-micro-Q5KS.gguf | Q5KS | 2.23GB | false | High quality, recommended. | | granite-4.0-h-micro-Q41.gguf | Q41 | 2.04GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | granite-4.0-h-micro-Q4KL.gguf | Q4KL | 1.99GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | granite-4.0-h-micro-Q4KM.gguf | Q4KM | 1.94GB | false | Good quality, default size for most use cases, recommended. | | granite-4.0-h-micro-Q4KS.gguf | Q4KS | 1.87GB | false | Slightly lower quality with more space savings, recommended. | | granite-4.0-h-micro-Q40.gguf | Q40 | 1.86GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | granite-4.0-h-micro-IQ4NL.gguf | IQ4NL | 1.86GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | granite-4.0-h-micro-IQ4XS.gguf | IQ4XS | 1.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | granite-4.0-h-micro-Q3KXL.gguf | Q3KXL | 1.69GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | granite-4.0-h-micro-Q3KL.gguf | Q3KL | 1.64GB | false | Lower quality but usable, good for low RAM availability. | | granite-4.0-h-micro-Q3KM.gguf | Q3KM | 1.56GB | false | Low quality. | | granite-4.0-h-micro-IQ3M.gguf | IQ3M | 1.47GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | granite-4.0-h-micro-Q3KS.gguf | Q3KS | 1.46GB | false | Low quality, not recommended. | | granite-4.0-h-micro-IQ3XS.gguf | IQ3XS | 1.41GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | granite-4.0-h-micro-IQ3XXS.gguf | IQ3XXS | 1.29GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | granite-4.0-h-micro-Q2KL.gguf | Q2KL | 1.28GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | granite-4.0-h-micro-Q2K.gguf | Q2K | 1.23GB | false | Very low quality but surprisingly usable. | | granite-4.0-h-micro-IQ2M.gguf | IQ2M | 1.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ibm-granitegranite-4.0-h-micro-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
LiquidAI_LFM2-350M-Math-GGUF
meta-llama_Llama-4-Scout-17B-16E-Instruct-old-GGUF
aya-23-8B-GGUF
HuatuoGPT-o1-72B-v0.1-GGUF
Phi-3-mini-4k-instruct-GGUF
Llama-3.1-8B-Lexi-Uncensored-V2-GGUF
ibm-granite_granite-vision-3.2-2b-GGUF
magnum-12b-v2-GGUF
nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF
soob3123_amoral-gemma3-4B-GGUF
reader-lm-1.5b-GGUF
Qwen2.5-Coder-14B-GGUF
agentica-org_DeepCoder-14B-Preview-GGUF
nbeerbower_Qwen3-Gutenberg-Encore-14B-GGUF
Qwen_Qwen3-8B-GGUF
Human-Like-LLama3-8B-Instruct-GGUF
nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-GGUF
Llamacpp imatrix Quantizations of Llama-33-Nemotron-Super-49B-v15 by nvidia Original model: https://huggingface.co/nvidia/Llama-33-Nemotron-Super-49B-v15 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Llama-33-Nemotron-Super-49B-v15-bf16.gguf | bf16 | 99.74GB | true | Full BF16 weights. | | Llama-33-Nemotron-Super-49B-v15-Q80.gguf | Q80 | 52.99GB | true | Extremely high quality, generally unneeded but max available quant. | | Llama-33-Nemotron-Super-49B-v15-Q6KL.gguf | Q6KL | 41.43GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q6K.gguf | Q6K | 40.92GB | false | Very high quality, near perfect, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q5KL.gguf | Q5KL | 36.04GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q5KM.gguf | Q5KM | 35.39GB | false | High quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q5KS.gguf | Q5KS | 34.43GB | false | High quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q41.gguf | Q41 | 31.38GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Llama-33-Nemotron-Super-49B-v15-Q4KL.gguf | Q4KL | 31.00GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q4KM.gguf | Q4KM | 30.22GB | false | Good quality, default size for most use cases, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q4KS.gguf | Q4KS | 28.63GB | false | Slightly lower quality with more space savings, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q40.gguf | Q40 | 28.46GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Llama-33-Nemotron-Super-49B-v15-IQ4NL.gguf | IQ4NL | 28.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Llama-33-Nemotron-Super-49B-v15-Q3KXL.gguf | Q3KXL | 27.19GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Llama-33-Nemotron-Super-49B-v15-IQ4XS.gguf | IQ4XS | 26.87GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Llama-33-Nemotron-Super-49B-v15-Q3KL.gguf | Q3KL | 26.27GB | false | Lower quality but usable, good for low RAM availability. | | Llama-33-Nemotron-Super-49B-v15-Q3KM.gguf | Q3KM | 24.31GB | false | Low quality. | | Llama-33-Nemotron-Super-49B-v15-IQ3M.gguf | IQ3M | 22.66GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Llama-33-Nemotron-Super-49B-v15-Q3KS.gguf | Q3KS | 21.96GB | false | Low quality, not recommended. | | Llama-33-Nemotron-Super-49B-v15-IQ3XS.gguf | IQ3XS | 20.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Llama-33-Nemotron-Super-49B-v15-Q2KL.gguf | Q2KL | 19.77GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Llama-33-Nemotron-Super-49B-v15-IQ3XXS.gguf | IQ3XXS | 19.52GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Llama-33-Nemotron-Super-49B-v15-Q2K.gguf | Q2K | 18.74GB | false | Very low quality but surprisingly usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2M.gguf | IQ2M | 17.16GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2S.gguf | IQ2S | 15.85GB | false | Low quality, uses SOTA techniques to be usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2XS.gguf | IQ2XS | 15.08GB | false | Low quality, uses SOTA techniques to be usable. | | Llama-33-Nemotron-Super-49B-v15-IQ2XXS.gguf | IQ2XXS | 13.66GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (nvidiaLlama-33-Nemotron-Super-49B-v15-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
zerofata_MS3.2-PaintedFantasy-Visage-v3-34B-GGUF
Llamacpp imatrix Quantizations of MS3.2-PaintedFantasy-Visage-v3-34B by zerofata Original model: https://huggingface.co/zerofata/MS3.2-PaintedFantasy-Visage-v3-34B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | MS3.2-PaintedFantasy-Visage-v3-34B-bf16.gguf | bf16 | 68.27GB | true | Full BF16 weights. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q80.gguf | Q80 | 36.27GB | false | Extremely high quality, generally unneeded but max available quant. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q6KL.gguf | Q6KL | 28.33GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q6K.gguf | Q6K | 28.01GB | false | Very high quality, near perfect, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q5KL.gguf | Q5KL | 24.65GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q5KM.gguf | Q5KM | 24.23GB | false | High quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q5KS.gguf | Q5KS | 23.56GB | false | High quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q41.gguf | Q41 | 21.47GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q4KL.gguf | Q4KL | 21.17GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q4KM.gguf | Q4KM | 20.68GB | false | Good quality, default size for most use cases, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q4KS.gguf | Q4KS | 19.53GB | false | Slightly lower quality with more space savings, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q40.gguf | Q40 | 19.46GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ4NL.gguf | IQ4NL | 19.42GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KXL.gguf | Q3KXL | 18.48GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ4XS.gguf | IQ4XS | 18.38GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KL.gguf | Q3KL | 17.89GB | false | Lower quality but usable, good for low RAM availability. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KM.gguf | Q3KM | 16.52GB | false | Low quality. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ3M.gguf | IQ3M | 15.30GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q3KS.gguf | Q3KS | 14.94GB | false | Low quality, not recommended. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ3XS.gguf | IQ3XS | 14.21GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q2KL.gguf | Q2KL | 13.40GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ3XXS.gguf | IQ3XXS | 13.33GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | MS3.2-PaintedFantasy-Visage-v3-34B-Q2K.gguf | Q2K | 12.74GB | false | Very low quality but surprisingly usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2M.gguf | IQ2M | 11.60GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2S.gguf | IQ2S | 10.66GB | false | Low quality, uses SOTA techniques to be usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2XS.gguf | IQ2XS | 10.30GB | false | Low quality, uses SOTA techniques to be usable. | | MS3.2-PaintedFantasy-Visage-v3-34B-IQ2XXS.gguf | IQ2XXS | 9.32GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (zerofataMS3.2-PaintedFantasy-Visage-v3-34B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
swiss-ai_Apertus-70B-Instruct-2509-GGUF
inclusionAI_Ling-mini-2.0-GGUF
internlm_OREAL-DeepSeek-R1-Distill-Qwen-7B-GGUF
OpenGVLab_InternVL3_5-14B-GGUF
Qwen_Qwen3-4B-GGUF
inclusionAI_Ring-mini-2.0-GGUF
Delta-Vector_Austral-32B-GLM4-Winton-GGUF
nomic-ai_nomic-embed-code-GGUF
Qwen2.5-Coder-1.5B-Instruct-GGUF
agentica-org_DeepSWE-Preview-GGUF
Qwen_Qwen2.5-VL-7B-Instruct-GGUF
Qwen2.5-14B_Uncencored-GGUF
nvidia_Llama-3.1-Nemotron-Nano-8B-v1-GGUF
AllThingsIntel_Apollo-V0.1-4B-Thinking-GGUF
EVA-Qwen2.5-32B-v0.2-GGUF
v2ray_GPT4chan-24B-GGUF
ibm-granite_granite-4.0-h-tiny-GGUF
Llamacpp imatrix Quantizations of granite-4.0-h-tiny by ibm-granite Original model: https://huggingface.co/ibm-granite/granite-4.0-h-tiny All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | granite-4.0-h-tiny-bf16.gguf | bf16 | 13.89GB | false | Full BF16 weights. | | granite-4.0-h-tiny-Q80.gguf | Q80 | 7.39GB | false | Extremely high quality, generally unneeded but max available quant. | | granite-4.0-h-tiny-Q6KL.gguf | Q6KL | 5.79GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | granite-4.0-h-tiny-Q6K.gguf | Q6K | 5.76GB | false | Very high quality, near perfect, recommended. | | granite-4.0-h-tiny-Q5KL.gguf | Q5KL | 5.05GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | granite-4.0-h-tiny-Q5KM.gguf | Q5KM | 5.02GB | false | High quality, recommended. | | granite-4.0-h-tiny-Q5KS.gguf | Q5KS | 4.86GB | false | High quality, recommended. | | granite-4.0-h-tiny-Q41.gguf | Q41 | 4.44GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | granite-4.0-h-tiny-Q4KL.gguf | Q4KL | 4.33GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | granite-4.0-h-tiny-Q4KM.gguf | Q4KM | 4.30GB | false | Good quality, default size for most use cases, recommended. | | granite-4.0-h-tiny-Q4KS.gguf | Q4KS | 4.15GB | false | Slightly lower quality with more space savings, recommended. | | granite-4.0-h-tiny-Q40.gguf | Q40 | 4.09GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | granite-4.0-h-tiny-IQ4NL.gguf | IQ4NL | 4.02GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | granite-4.0-h-tiny-IQ4XS.gguf | IQ4XS | 3.82GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | granite-4.0-h-tiny-Q3KXL.gguf | Q3KXL | 3.45GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | granite-4.0-h-tiny-Q3KL.gguf | Q3KL | 3.41GB | false | Lower quality but usable, good for low RAM availability. | | granite-4.0-h-tiny-Q3KM.gguf | Q3KM | 3.29GB | false | Low quality. | | granite-4.0-h-tiny-IQ3M.gguf | IQ3M | 3.29GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | granite-4.0-h-tiny-Q3KS.gguf | Q3KS | 3.15GB | false | Low quality, not recommended. | | granite-4.0-h-tiny-IQ3XS.gguf | IQ3XS | 3.01GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | granite-4.0-h-tiny-IQ3XXS.gguf | IQ3XXS | 2.87GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | granite-4.0-h-tiny-Q2KL.gguf | Q2KL | 2.62GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | granite-4.0-h-tiny-Q2K.gguf | Q2K | 2.59GB | false | Very low quality but surprisingly usable. | | granite-4.0-h-tiny-IQ2M.gguf | IQ2M | 2.29GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ibm-granitegranite-4.0-h-tiny-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Delta-Vector_Austral-24B-Winton-GGUF
huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF
granite-20b-code-instruct-GGUF
swiss-ai_Apertus-8B-Instruct-2509-GGUF
uncensoredai_UncensoredLM-DeepSeek-R1-Distill-Qwen-14B-GGUF
zai-org_GLM-4.7-Flash-GGUF
FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview-GGUF
TheDrummer_Valkyrie-49B-v1-GGUF
LLaMA-Mesh-GGUF
LiquidAI_LFM2-8B-A1B-GGUF
Llamacpp imatrix Quantizations of LFM2-8B-A1B by LiquidAI Original model: https://huggingface.co/LiquidAI/LFM2-8B-A1B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | LFM2-8B-A1B-bf16.gguf | bf16 | 16.69GB | false | Full BF16 weights. | | LFM2-8B-A1B-Q80.gguf | Q80 | 8.87GB | false | Extremely high quality, generally unneeded but max available quant. | | LFM2-8B-A1B-Q6KL.gguf | Q6KL | 6.88GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | LFM2-8B-A1B-Q6K.gguf | Q6K | 6.85GB | false | Very high quality, near perfect, recommended. | | LFM2-8B-A1B-Q5KL.gguf | Q5KL | 5.95GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | LFM2-8B-A1B-Q5KM.gguf | Q5KM | 5.92GB | false | High quality, recommended. | | LFM2-8B-A1B-Q5KS.gguf | Q5KS | 5.76GB | false | High quality, recommended. | | LFM2-8B-A1B-Q41.gguf | Q41 | 5.25GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | LFM2-8B-A1B-Q4KL.gguf | Q4KL | 5.08GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | LFM2-8B-A1B-Q4KM.gguf | Q4KM | 5.05GB | false | Good quality, default size for most use cases, recommended. | | LFM2-8B-A1B-Q4KS.gguf | Q4KS | 4.89GB | false | Slightly lower quality with more space savings, recommended. | | LFM2-8B-A1B-Q40.gguf | Q40 | 4.81GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | LFM2-8B-A1B-IQ4NL.gguf | IQ4NL | 4.74GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | LFM2-8B-A1B-IQ4XS.gguf | IQ4XS | 4.48GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | LFM2-8B-A1B-Q3KXL.gguf | Q3KXL | 3.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | LFM2-8B-A1B-Q3KL.gguf | Q3KL | 3.96GB | false | Lower quality but usable, good for low RAM availability. | | LFM2-8B-A1B-Q3KM.gguf | Q3KM | 3.82GB | false | Low quality. | | LFM2-8B-A1B-IQ3M.gguf | IQ3M | 3.82GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | LFM2-8B-A1B-Q3KS.gguf | Q3KS | 3.65GB | false | Low quality, not recommended. | | LFM2-8B-A1B-IQ3XS.gguf | IQ3XS | 3.46GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | LFM2-8B-A1B-IQ3XXS.gguf | IQ3XXS | 3.31GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | LFM2-8B-A1B-Q2KL.gguf | Q2KL | 2.98GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | LFM2-8B-A1B-Q2K.gguf | Q2K | 2.95GB | false | Very low quality but surprisingly usable. | | LFM2-8B-A1B-IQ2M.gguf | IQ2M | 2.65GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (LiquidAILFM2-8B-A1B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
zerofata_GLM-4.5-Iceblink-106B-A12B-GGUF
Llamacpp imatrix Quantizations of GLM-4.5-Iceblink-106B-A12B by zerofata Original model: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | GLM-4.5-Iceblink-106B-A12B-Q80.gguf | Q80 | 117.46GB | true | Extremely high quality, generally unneeded but max available quant. | | GLM-4.5-Iceblink-106B-A12B-Q6K.gguf | Q6K | 99.18GB | true | Very high quality, near perfect, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q5KM.gguf | Q5KM | 83.72GB | true | High quality, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q5KS.gguf | Q5KS | 78.55GB | true | High quality, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q4KM.gguf | Q4KM | 73.50GB | true | Good quality, default size for most use cases, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q41.gguf | Q41 | 69.55GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | GLM-4.5-Iceblink-106B-A12B-Q4KS.gguf | Q4KS | 68.31GB | true | Slightly lower quality with more space savings, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q40.gguf | Q40 | 63.76GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | GLM-4.5-Iceblink-106B-A12B-IQ4NL.gguf | IQ4NL | 63.06GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | GLM-4.5-Iceblink-106B-A12B-IQ4XS.gguf | IQ4XS | 60.81GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | GLM-4.5-Iceblink-106B-A12B-Q3KXL.gguf | Q3KXL | 56.45GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | GLM-4.5-Iceblink-106B-A12B-Q3KL.gguf | Q3KL | 55.91GB | true | Lower quality but usable, good for low RAM availability. | | GLM-4.5-Iceblink-106B-A12B-Q3KM.gguf | Q3KM | 55.48GB | true | Low quality. | | GLM-4.5-Iceblink-106B-A12B-IQ3M.gguf | IQ3M | 55.48GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | GLM-4.5-Iceblink-106B-A12B-Q3KS.gguf | Q3KS | 53.42GB | true | Low quality, not recommended. | | GLM-4.5-Iceblink-106B-A12B-IQ3XS.gguf | IQ3XS | 50.84GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | GLM-4.5-Iceblink-106B-A12B-IQ3XXS.gguf | IQ3XXS | 50.34GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | GLM-4.5-Iceblink-106B-A12B-Q2KL.gguf | Q2KL | 46.71GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | GLM-4.5-Iceblink-106B-A12B-Q2K.gguf | Q2K | 46.10GB | false | Very low quality but surprisingly usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2M.gguf | IQ2M | 45.12GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2S.gguf | IQ2S | 42.54GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2XS.gguf | IQ2XS | 42.19GB | false | Low quality, uses SOTA techniques to be usable. | | GLM-4.5-Iceblink-106B-A12B-IQ2XXS.gguf | IQ2XXS | 39.62GB | false | Very low quality, uses SOTA techniques to be usable. | | GLM-4.5-Iceblink-106B-A12B-IQ1M.gguf | IQ1M | 37.86GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (zerofataGLM-4.5-Iceblink-106B-A12B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
a-m-team_AM-Thinking-v1-GGUF
baichuan-inc_Baichuan-M2-32B-GGUF
ai21labs_AI21-Jamba-Large-1.7-GGUF
Llamacpp imatrix Quantizations of AI21-Jamba-Large-1.7 by ai21labs Original model: https://huggingface.co/ai21labs/AI21-Jamba-Large-1.7 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | AI21-Jamba-Large-1.7-Q80.gguf | Q80 | 423.55GB | true | Extremely high quality, generally unneeded but max available quant. | | AI21-Jamba-Large-1.7-Q6K.gguf | Q6K | 327.05GB | true | Very high quality, near perfect, recommended. | | AI21-Jamba-Large-1.7-Q5KM.gguf | Q5KM | 282.39GB | true | High quality, recommended. | | AI21-Jamba-Large-1.7-Q5KS.gguf | Q5KS | 274.21GB | true | High quality, recommended. | | AI21-Jamba-Large-1.7-Q41.gguf | Q41 | 249.34GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | AI21-Jamba-Large-1.7-Q4KL.gguf | Q4KL | 240.83GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | AI21-Jamba-Large-1.7-Q4KM.gguf | Q4KM | 240.44GB | true | Good quality, default size for most use cases, recommended. | | AI21-Jamba-Large-1.7-Q4KS.gguf | Q4KS | 231.92GB | true | Slightly lower quality with more space savings, recommended. | | AI21-Jamba-Large-1.7-Q40.gguf | Q40 | 228.15GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | AI21-Jamba-Large-1.7-IQ4NL.gguf | IQ4NL | 224.54GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | AI21-Jamba-Large-1.7-IQ4XS.gguf | IQ4XS | 212.13GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | AI21-Jamba-Large-1.7-Q3KXL.gguf | Q3KXL | 188.93GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Large-1.7-Q3KL.gguf | Q3KL | 188.46GB | true | Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Large-1.7-Q3KM.gguf | Q3KM | 180.50GB | true | Low quality. | | AI21-Jamba-Large-1.7-IQ3M.gguf | IQ3M | 179.62GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | AI21-Jamba-Large-1.7-Q3KS.gguf | Q3KS | 171.77GB | true | Low quality, not recommended. | | AI21-Jamba-Large-1.7-IQ3XS.gguf | IQ3XS | 163.08GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | AI21-Jamba-Large-1.7-IQ3XXS.gguf | IQ3XXS | 155.79GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | AI21-Jamba-Large-1.7-Q2KL.gguf | Q2KL | 138.58GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | AI21-Jamba-Large-1.7-Q2K.gguf | Q2K | 138.06GB | true | Very low quality but surprisingly usable. | | AI21-Jamba-Large-1.7-IQ2M.gguf | IQ2M | 125.79GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | AI21-Jamba-Large-1.7-IQ2S.gguf | IQ2S | 111.14GB | true | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Large-1.7-IQ2XS.gguf | IQ2XS | 110.71GB | true | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Large-1.7-IQ2XXS.gguf | IQ2XXS | 96.37GB | true | Very low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Large-1.7-IQ1M.gguf | IQ1M | 85.91GB | true | Extremely low quality, not recommended. | | AI21-Jamba-Large-1.7-IQ1S.gguf | IQ1S | 81.88GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ai21labsAI21-Jamba-Large-1.7-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
DeepSeek-R1-GGUF
zetasepic_Mistral-Small-Instruct-2409-abliterated-GGUF
allura-org_Q3-30B-A3B-Designant-GGUF
mlabonne_gemma-3-12b-it-abliterated-GGUF
google_gemma-3n-E2B-it-GGUF
EXAONE-3.0-7.8B-Instruct-GGUF
sophosympatheia_Strawberrylemonade-70B-v1.1-GGUF
MN-12B-Celeste-V1.9-GGUF
Llama-3.1-SuperNova-Lite-GGUF
perplexity-ai_r1-1776-distill-llama-70b-GGUF
Qwen2.5-32B-ArliAI-RPMax-v1.3-GGUF
QVQ-72B-Preview-GGUF
LongWriter-llama3.1-8b-GGUF
Qwen_Qwen3-VL-2B-Instruct-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-2B-Instruct by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-2B-Instruct-bf16.gguf | bf16 | 3.45GB | false | Full BF16 weights. | | Qwen3-VL-2B-Instruct-Q80.gguf | Q80 | 1.83GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-2B-Instruct-Q6KL.gguf | Q6KL | 1.49GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Instruct-Q6K.gguf | Q6K | 1.42GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Instruct-Q5KL.gguf | Q5KL | 1.33GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-2B-Instruct-Q5KM.gguf | Q5KM | 1.26GB | false | High quality, recommended. | | Qwen3-VL-2B-Instruct-Q5KS.gguf | Q5KS | 1.23GB | false | High quality, recommended. | | Qwen3-VL-2B-Instruct-Q4KL.gguf | Q4KL | 1.18GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-2B-Instruct-Q41.gguf | Q41 | 1.14GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-2B-Instruct-Q4KM.gguf | Q4KM | 1.11GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-2B-Instruct-Q3KXL.gguf | Q3KXL | 1.08GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Instruct-Q4KS.gguf | Q4KS | 1.06GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-2B-Instruct-Q40.gguf | Q40 | 1.06GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-2B-Instruct-IQ4NL.gguf | IQ4NL | 1.05GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-2B-Instruct-IQ4XS.gguf | IQ4XS | 1.01GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-2B-Instruct-Q3KL.gguf | Q3KL | 1.00GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Instruct-Q3KM.gguf | Q3KM | 0.94GB | false | Low quality. | | Qwen3-VL-2B-Instruct-IQ3M.gguf | IQ3M | 0.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-2B-Instruct-Q3KS.gguf | Q3KS | 0.87GB | false | Low quality, not recommended. | | Qwen3-VL-2B-Instruct-Q2KL.gguf | Q2KL | 0.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-2B-Instruct-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-2B-Instruct-Q2K.gguf | Q2K | 0.78GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-2B-Instruct-IQ3XXS.gguf | IQ3XXS | 0.75GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-2B-Instruct-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-2B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Qwen2.5-Coder-0.5B-Instruct-GGUF
nvidia_AceReason-Nemotron-14B-GGUF
Qwen_Qwen3-VL-4B-Thinking-GGUF
Menlo_Lucy-GGUF
Human-Like-Mistral-Nemo-Instruct-2407-GGUF
Qwen_Qwen3-1.7B-GGUF
MN-12B-Mag-Mell-R1-GGUF
burtenshaw_GemmaCoder3-12B-GGUF
Qwen2.5-14B-Instruct-1M-GGUF
granite-3.0-8b-instruct-GGUF
Impish_Mind_8B-GGUF
Stheno-Hercules-3.1-8B-GGUF
TheDrummer_Cydonia-R1-24B-v4-GGUF
Meta-Llama-3.1-8B-Claude-GGUF
open-thoughts_OpenThinker-32B-GGUF
nvidia_Llama-3.1-Nemotron-Nano-4B-v1.1-GGUF
v6-Finch-7B-HF-GGUF
PocketDoc_Dans-PersonalityEngine-V1.3.0-12b-GGUF
TheDrummer_Behemoth-R1-123B-v2-GGUF
google_medgemma-27b-it-GGUF
SmallThinker-3B-Preview-GGUF
mistralai_Magistral-Small-2506-GGUF
Sailor2-1B-Chat-GGUF
ilsp_Llama-Krikri-8B-Instruct-GGUF
xai-org_grok-2-GGUF
Llamacpp imatrix Quantizations of grok-2 by xai-org Original model: https://huggingface.co/xai-org/grok-2 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | grok-2-Q80.gguf | Q80 | 286.39GB | true | Extremely high quality, generally unneeded but max available quant. | | grok-2-Q6K.gguf | Q6K | 221.37GB | true | Very high quality, near perfect, recommended. | | grok-2-Q5KM.gguf | Q5KM | 191.57GB | true | High quality, recommended. | | grok-2-Q5KS.gguf | Q5KS | 185.87GB | true | High quality, recommended. | | grok-2-Q41.gguf | Q41 | 169.16GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | grok-2-Q4KL.gguf | Q4KL | 164.85GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | grok-2-Q4KM.gguf | Q4KM | 164.06GB | true | Good quality, default size for most use cases, recommended. | | grok-2-Q4KS.gguf | Q4KS | 157.55GB | true | Slightly lower quality with more space savings, recommended. | | grok-2-Q40.gguf | Q40 | 154.73GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | grok-2-IQ4NL.gguf | IQ4NL | 152.98GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | grok-2-IQ4XS.gguf | IQ4XS | 144.76GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | grok-2-Q3KXL.gguf | Q3KXL | 131.16GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | grok-2-Q3KL.gguf | Q3KL | 130.22GB | true | Lower quality but usable, good for low RAM availability. | | grok-2-Q3KM.gguf | Q3KM | 125.02GB | true | Low quality. | | grok-2-IQ3M.gguf | IQ3M | 123.75GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | grok-2-Q3KS.gguf | Q3KS | 118.04GB | true | Low quality, not recommended. | | grok-2-IQ3XS.gguf | IQ3XS | 111.80GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | grok-2-IQ3XXS.gguf | IQ3XXS | 106.96GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | grok-2-Q2KL.gguf | Q2KL | 97.61GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | grok-2-Q2K.gguf | Q2K | 96.56GB | true | Very low quality but surprisingly usable. | | grok-2-IQ2M.gguf | IQ2M | 88.21GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | grok-2-IQ2S.gguf | IQ2S | 78.71GB | true | Low quality, uses SOTA techniques to be usable. | | grok-2-IQ2XS.gguf | IQ2XS | 77.74GB | true | Low quality, uses SOTA techniques to be usable. | | grok-2-IQ2XXS.gguf | IQ2XXS | 68.52GB | true | Very low quality, uses SOTA techniques to be usable. | | grok-2-IQ1M.gguf | IQ1M | 61.38GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (xai-orggrok-2-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
CrucibleLab_M3.2-24B-Loki-V1.3-GGUF
Llamacpp imatrix Quantizations of M3.2-24B-Loki-V1.3 by CrucibleLab Original model: https://huggingface.co/CrucibleLab/M3.2-24B-Loki-V1.3 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | M3.2-24B-Loki-V1.3-bf16.gguf | bf16 | 47.15GB | false | Full BF16 weights. | | M3.2-24B-Loki-V1.3-Q80.gguf | Q80 | 25.05GB | false | Extremely high quality, generally unneeded but max available quant. | | M3.2-24B-Loki-V1.3-Q6KL.gguf | Q6KL | 19.67GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | M3.2-24B-Loki-V1.3-Q6K.gguf | Q6K | 19.35GB | false | Very high quality, near perfect, recommended. | | M3.2-24B-Loki-V1.3-Q5KL.gguf | Q5KL | 17.18GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | M3.2-24B-Loki-V1.3-Q5KM.gguf | Q5KM | 16.76GB | false | High quality, recommended. | | M3.2-24B-Loki-V1.3-Q5KS.gguf | Q5KS | 16.30GB | false | High quality, recommended. | | M3.2-24B-Loki-V1.3-Q41.gguf | Q41 | 14.87GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | M3.2-24B-Loki-V1.3-Q4KL.gguf | Q4KL | 14.83GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | M3.2-24B-Loki-V1.3-Q4KM.gguf | Q4KM | 14.33GB | false | Good quality, default size for most use cases, recommended. | | M3.2-24B-Loki-V1.3-Q4KS.gguf | Q4KS | 13.55GB | false | Slightly lower quality with more space savings, recommended. | | M3.2-24B-Loki-V1.3-Q40.gguf | Q40 | 13.49GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | M3.2-24B-Loki-V1.3-IQ4NL.gguf | IQ4NL | 13.47GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | M3.2-24B-Loki-V1.3-Q3KXL.gguf | Q3KXL | 12.99GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | M3.2-24B-Loki-V1.3-IQ4XS.gguf | IQ4XS | 12.76GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | M3.2-24B-Loki-V1.3-Q3KL.gguf | Q3KL | 12.40GB | false | Lower quality but usable, good for low RAM availability. | | M3.2-24B-Loki-V1.3-Q3KM.gguf | Q3KM | 11.47GB | false | Low quality. | | M3.2-24B-Loki-V1.3-IQ3M.gguf | IQ3M | 10.65GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | M3.2-24B-Loki-V1.3-Q3KS.gguf | Q3KS | 10.40GB | false | Low quality, not recommended. | | M3.2-24B-Loki-V1.3-IQ3XS.gguf | IQ3XS | 9.91GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | M3.2-24B-Loki-V1.3-Q2KL.gguf | Q2KL | 9.55GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | M3.2-24B-Loki-V1.3-IQ3XXS.gguf | IQ3XXS | 9.28GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | M3.2-24B-Loki-V1.3-Q2K.gguf | Q2K | 8.89GB | false | Very low quality but surprisingly usable. | | M3.2-24B-Loki-V1.3-IQ2M.gguf | IQ2M | 8.11GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | M3.2-24B-Loki-V1.3-IQ2S.gguf | IQ2S | 7.48GB | false | Low quality, uses SOTA techniques to be usable. | | M3.2-24B-Loki-V1.3-IQ2XS.gguf | IQ2XS | 7.21GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (CrucibleLabM3.2-24B-Loki-V1.3-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
LLAMA-3_8B_Unaligned_BETA-GGUF
Qwen2.5-32B-AGI-GGUF
Athene-V2-Agent-GGUF
Qwen2-VL-72B-Instruct-GGUF
OpenGVLab_InternVL3_5-8B-GGUF
Falcon3-10B-Instruct-GGUF
Qwen_Qwen3-VL-2B-Thinking-GGUF
Llamacpp imatrix Quantizations of Qwen3-VL-2B-Thinking by Qwen Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Thinking All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Qwen3-VL-2B-Thinking-bf16.gguf | bf16 | 3.45GB | false | Full BF16 weights. | | Qwen3-VL-2B-Thinking-Q80.gguf | Q80 | 1.83GB | false | Extremely high quality, generally unneeded but max available quant. | | Qwen3-VL-2B-Thinking-Q6KL.gguf | Q6KL | 1.49GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Thinking-Q6K.gguf | Q6K | 1.42GB | false | Very high quality, near perfect, recommended. | | Qwen3-VL-2B-Thinking-Q5KL.gguf | Q5KL | 1.33GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Qwen3-VL-2B-Thinking-Q5KM.gguf | Q5KM | 1.26GB | false | High quality, recommended. | | Qwen3-VL-2B-Thinking-Q5KS.gguf | Q5KS | 1.23GB | false | High quality, recommended. | | Qwen3-VL-2B-Thinking-Q4KL.gguf | Q4KL | 1.18GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Qwen3-VL-2B-Thinking-Q41.gguf | Q41 | 1.14GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Qwen3-VL-2B-Thinking-Q4KM.gguf | Q4KM | 1.11GB | false | Good quality, default size for most use cases, recommended. | | Qwen3-VL-2B-Thinking-Q3KXL.gguf | Q3KXL | 1.08GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Thinking-Q4KS.gguf | Q4KS | 1.06GB | false | Slightly lower quality with more space savings, recommended. | | Qwen3-VL-2B-Thinking-Q40.gguf | Q40 | 1.06GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Qwen3-VL-2B-Thinking-IQ4NL.gguf | IQ4NL | 1.05GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Qwen3-VL-2B-Thinking-IQ4XS.gguf | IQ4XS | 1.01GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Qwen3-VL-2B-Thinking-Q3KL.gguf | Q3KL | 1.00GB | false | Lower quality but usable, good for low RAM availability. | | Qwen3-VL-2B-Thinking-Q3KM.gguf | Q3KM | 0.94GB | false | Low quality. | | Qwen3-VL-2B-Thinking-IQ3M.gguf | IQ3M | 0.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Qwen3-VL-2B-Thinking-Q3KS.gguf | Q3KS | 0.87GB | false | Low quality, not recommended. | | Qwen3-VL-2B-Thinking-Q2KL.gguf | Q2KL | 0.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Qwen3-VL-2B-Thinking-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Qwen3-VL-2B-Thinking-Q2K.gguf | Q2K | 0.78GB | false | Very low quality but surprisingly usable. | | Qwen3-VL-2B-Thinking-IQ3XXS.gguf | IQ3XXS | 0.75GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Qwen3-VL-2B-Thinking-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (QwenQwen3-VL-2B-Thinking-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
EVA-Yi-1.5-9B-32K-V1-GGUF
Ministral-8B-Instruct-2410-HF-GGUF-TEST
gemma-2-27b-it-SimPO-37K-GGUF
Qwen2.5-72b-RP-Ink-GGUF
gustavecortal_Beck-4B-GGUF
Llamacpp imatrix Quantizations of Beck-4B by gustavecortal Original model: https://huggingface.co/gustavecortal/Beck-4B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Beck-4B-bf16.gguf | bf16 | 8.05GB | false | Full BF16 weights. | | Beck-4B-Q80.gguf | Q80 | 4.28GB | false | Extremely high quality, generally unneeded but max available quant. | | Beck-4B-Q6KL.gguf | Q6KL | 3.40GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Beck-4B-Q6K.gguf | Q6K | 3.31GB | false | Very high quality, near perfect, recommended. | | Beck-4B-Q5KL.gguf | Q5KL | 2.98GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Beck-4B-Q5KM.gguf | Q5KM | 2.89GB | false | High quality, recommended. | | Beck-4B-Q5KS.gguf | Q5KS | 2.82GB | false | High quality, recommended. | | Beck-4B-Q41.gguf | Q41 | 2.60GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Beck-4B-Q4KL.gguf | Q4KL | 2.59GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Beck-4B-Q4KM.gguf | Q4KM | 2.50GB | false | Good quality, default size for most use cases, recommended. | | Beck-4B-Q4KS.gguf | Q4KS | 2.38GB | false | Slightly lower quality with more space savings, recommended. | | Beck-4B-Q40.gguf | Q40 | 2.38GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Beck-4B-IQ4NL.gguf | IQ4NL | 2.38GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Beck-4B-Q3KXL.gguf | Q3KXL | 2.33GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Beck-4B-IQ4XS.gguf | IQ4XS | 2.27GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Beck-4B-Q3KL.gguf | Q3KL | 2.24GB | false | Lower quality but usable, good for low RAM availability. | | Beck-4B-Q3KM.gguf | Q3KM | 2.08GB | false | Low quality. | | Beck-4B-IQ3M.gguf | IQ3M | 1.96GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Beck-4B-Q3KS.gguf | Q3KS | 1.89GB | false | Low quality, not recommended. | | Beck-4B-IQ3XS.gguf | IQ3XS | 1.81GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Beck-4B-Q2KL.gguf | Q2KL | 1.76GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Beck-4B-IQ3XXS.gguf | IQ3XXS | 1.67GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Beck-4B-Q2K.gguf | Q2K | 1.67GB | false | Very low quality but surprisingly usable. | | Beck-4B-IQ2M.gguf | IQ2M | 1.51GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (gustavecortalBeck-4B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
ai21labs_AI21-Jamba-Mini-1.7-GGUF
Llamacpp imatrix Quantizations of AI21-Jamba-Mini-1.7 by ai21labs Original model: https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.7 All quants made using imatrix option with dataset from here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | AI21-Jamba-Mini-1.7-bf16.gguf | bf16 | 103.16GB | true | Full BF16 weights. | | AI21-Jamba-Mini-1.7-Q80.gguf | Q80 | 54.81GB | true | Extremely high quality, generally unneeded but max available quant. | | AI21-Jamba-Mini-1.7-Q6KL.gguf | Q6KL | 42.46GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | AI21-Jamba-Mini-1.7-Q6K.gguf | Q6K | 42.33GB | false | Very high quality, near perfect, recommended. | | AI21-Jamba-Mini-1.7-Q5KL.gguf | Q5KL | 36.75GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | AI21-Jamba-Mini-1.7-Q5KM.gguf | Q5KM | 36.58GB | false | High quality, recommended. | | AI21-Jamba-Mini-1.7-Q5KS.gguf | Q5KS | 35.52GB | false | High quality, recommended. | | AI21-Jamba-Mini-1.7-Q41.gguf | Q41 | 32.32GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | AI21-Jamba-Mini-1.7-Q4KL.gguf | Q4KL | 31.38GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | AI21-Jamba-Mini-1.7-Q4KM.gguf | Q4KM | 31.18GB | false | Good quality, default size for most use cases, recommended. | | AI21-Jamba-Mini-1.7-Q4KS.gguf | Q4KS | 30.07GB | false | Slightly lower quality with more space savings, recommended. | | AI21-Jamba-Mini-1.7-Q40.gguf | Q40 | 29.59GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | AI21-Jamba-Mini-1.7-IQ4NL.gguf | IQ4NL | 29.12GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | AI21-Jamba-Mini-1.7-IQ4XS.gguf | IQ4XS | 27.52GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | AI21-Jamba-Mini-1.7-Q3KXL.gguf | Q3KXL | 24.72GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Mini-1.7-Q3KL.gguf | Q3KL | 24.48GB | false | Lower quality but usable, good for low RAM availability. | | AI21-Jamba-Mini-1.7-Q3KM.gguf | Q3KM | 23.45GB | false | Low quality. | | AI21-Jamba-Mini-1.7-IQ3M.gguf | IQ3M | 23.33GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | AI21-Jamba-Mini-1.7-Q3KS.gguf | Q3KS | 22.32GB | false | Low quality, not recommended. | | AI21-Jamba-Mini-1.7-IQ3XS.gguf | IQ3XS | 21.19GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | AI21-Jamba-Mini-1.7-IQ3XXS.gguf | IQ3XXS | 20.24GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | AI21-Jamba-Mini-1.7-Q2KL.gguf | Q2KL | 18.24GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | AI21-Jamba-Mini-1.7-Q2K.gguf | Q2K | 17.98GB | false | Very low quality but surprisingly usable. | | AI21-Jamba-Mini-1.7-IQ2M.gguf | IQ2M | 16.24GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | AI21-Jamba-Mini-1.7-IQ2S.gguf | IQ2S | 14.41GB | false | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Mini-1.7-IQ2XS.gguf | IQ2XS | 14.34GB | false | Low quality, uses SOTA techniques to be usable. | | AI21-Jamba-Mini-1.7-IQ2XXS.gguf | IQ2XXS | 12.48GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ai21labsAI21-Jamba-Mini-1.7-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
RekaAI_reka-flash-3.1-GGUF
Dolphin3.0-Qwen2.5-0.5B-GGUF
TheDrummer_Behemoth-ReduX-123B-v1.1-GGUF
Llamacpp imatrix Quantizations of Behemoth-ReduX-123B-v1.1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Behemoth-ReduX-123B-v1.1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Behemoth-ReduX-123B-v1.1-Q80.gguf | Q80 | 130.28GB | true | Extremely high quality, generally unneeded but max available quant. | | Behemoth-ReduX-123B-v1.1-Q6K.gguf | Q6K | 100.59GB | true | Very high quality, near perfect, recommended. | | Behemoth-ReduX-123B-v1.1-Q5KM.gguf | Q5KM | 86.49GB | true | High quality, recommended. | | Behemoth-ReduX-123B-v1.1-Q5KS.gguf | Q5KS | 84.36GB | true | High quality, recommended. | | Behemoth-ReduX-123B-v1.1-Q41.gguf | Q41 | 76.72GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Behemoth-ReduX-123B-v1.1-Q4KL.gguf | Q4KL | 73.52GB | true | Uses Q80 for embed and output weights. Good quality, recommended. | | Behemoth-ReduX-123B-v1.1-Q4KM.gguf | Q4KM | 73.22GB | true | Good quality, default size for most use cases, recommended. | | Behemoth-ReduX-123B-v1.1-Q4KS.gguf | Q4KS | 69.57GB | true | Slightly lower quality with more space savings, recommended. | | Behemoth-ReduX-123B-v1.1-Q40.gguf | Q40 | 69.32GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Behemoth-ReduX-123B-v1.1-IQ4NL.gguf | IQ4NL | 69.22GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Behemoth-ReduX-123B-v1.1-IQ4XS.gguf | IQ4XS | 65.43GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | Behemoth-ReduX-123B-v1.1-Q3KXL.gguf | Q3KXL | 64.91GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Behemoth-ReduX-123B-v1.1-Q3KL.gguf | Q3KL | 64.55GB | true | Lower quality but usable, good for low RAM availability. | | Behemoth-ReduX-123B-v1.1-Q3KM.gguf | Q3KM | 59.10GB | true | Low quality. | | Behemoth-ReduX-123B-v1.1-IQ3M.gguf | IQ3M | 55.28GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | Behemoth-ReduX-123B-v1.1-Q3KS.gguf | Q3KS | 52.85GB | true | Low quality, not recommended. | | Behemoth-ReduX-123B-v1.1-IQ3XS.gguf | IQ3XS | 50.14GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | Behemoth-ReduX-123B-v1.1-IQ3XXS.gguf | IQ3XXS | 47.01GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Behemoth-ReduX-123B-v1.1-Q2KL.gguf | Q2KL | 45.59GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Behemoth-ReduX-123B-v1.1-Q2K.gguf | Q2K | 45.20GB | false | Very low quality but surprisingly usable. | | Behemoth-ReduX-123B-v1.1-IQ2M.gguf | IQ2M | 41.62GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Behemoth-ReduX-123B-v1.1-IQ2S.gguf | IQ2S | 38.38GB | false | Low quality, uses SOTA techniques to be usable. | | Behemoth-ReduX-123B-v1.1-IQ2XS.gguf | IQ2XS | 36.08GB | false | Low quality, uses SOTA techniques to be usable. | | Behemoth-ReduX-123B-v1.1-IQ2XXS.gguf | IQ2XXS | 32.43GB | false | Very low quality, uses SOTA techniques to be usable. | | Behemoth-ReduX-123B-v1.1-IQ1M.gguf | IQ1M | 28.39GB | false | Extremely low quality, not recommended. | | Behemoth-ReduX-123B-v1.1-IQ1S.gguf | IQ1S | 25.96GB | false | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerBehemoth-ReduX-123B-v1.1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
glm-4-9b-chat-abliterated-GGUF
Mistral-Large-Instruct-2407-GGUF
Pantheon-RP-1.6.1-12b-Nemo-GGUF
TheDrummer_Fallen-Gemma3-4B-v1-GGUF
Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
Llama3.2-3B-ShiningValiant2-GGUF
Delta-Vector_MS3.2-Austral-Winton-GGUF
DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0-GGUF
LatitudeGames_Wayfarer-2-12B-GGUF
Llamacpp imatrix Quantizations of Wayfarer-2-12B by LatitudeGames Original model: https://huggingface.co/LatitudeGames/Wayfarer-2-12B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Wayfarer-2-12B-bf16.gguf | bf16 | 24.50GB | false | Full BF16 weights. | | Wayfarer-2-12B-Q80.gguf | Q80 | 13.02GB | false | Extremely high quality, generally unneeded but max available quant. | | Wayfarer-2-12B-Q6KL.gguf | Q6KL | 10.38GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Wayfarer-2-12B-Q6K.gguf | Q6K | 10.06GB | false | Very high quality, near perfect, recommended. | | Wayfarer-2-12B-Q5KL.gguf | Q5KL | 9.14GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Wayfarer-2-12B-Q5KM.gguf | Q5KM | 8.73GB | false | High quality, recommended. | | Wayfarer-2-12B-Q5KS.gguf | Q5KS | 8.52GB | false | High quality, recommended. | | Wayfarer-2-12B-Q4KL.gguf | Q4KL | 7.98GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Wayfarer-2-12B-Q41.gguf | Q41 | 7.80GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Wayfarer-2-12B-Q4KM.gguf | Q4KM | 7.48GB | false | Good quality, default size for most use cases, recommended. | | Wayfarer-2-12B-Q3KXL.gguf | Q3KXL | 7.15GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Wayfarer-2-12B-Q4KS.gguf | Q4KS | 7.12GB | false | Slightly lower quality with more space savings, recommended. | | Wayfarer-2-12B-IQ4NL.gguf | IQ4NL | 7.10GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Wayfarer-2-12B-Q40.gguf | Q40 | 7.09GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Wayfarer-2-12B-IQ4XS.gguf | IQ4XS | 6.74GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Wayfarer-2-12B-Q3KL.gguf | Q3KL | 6.56GB | false | Lower quality but usable, good for low RAM availability. | | Wayfarer-2-12B-Q3KM.gguf | Q3KM | 6.08GB | false | Low quality. | | Wayfarer-2-12B-IQ3M.gguf | IQ3M | 5.72GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Wayfarer-2-12B-Q3KS.gguf | Q3KS | 5.53GB | false | Low quality, not recommended. | | Wayfarer-2-12B-Q2KL.gguf | Q2KL | 5.45GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Wayfarer-2-12B-IQ3XS.gguf | IQ3XS | 5.31GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Wayfarer-2-12B-IQ3XXS.gguf | IQ3XXS | 4.95GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Wayfarer-2-12B-Q2K.gguf | Q2K | 4.79GB | false | Very low quality but surprisingly usable. | | Wayfarer-2-12B-IQ2M.gguf | IQ2M | 4.44GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Wayfarer-2-12B-IQ2S.gguf | IQ2S | 4.14GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (LatitudeGamesWayfarer-2-12B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Palmyra-Med-70B-32K-GGUF
Llama-Song-Stream-3B-Instruct-GGUF
TildeAI_TildeOpen-30b-GGUF
huihui-ai_Qwen3-14B-abliterated-GGUF
WizardLM-2-8x22B-GGUF
Skywork_Skywork-R1V3-38B-GGUF
Delta-Vector_Plesio-70B-GGUF
deepthought-8b-llama-v0.01-alpha-GGUF
huihui-ai_gemma-3-1b-it-abliterated-GGUF
aws-prototyping_codefu-7b-v0.1-GGUF
LGAI-EXAONE_EXAONE-4.0-32B-GGUF
Llama-3.1-8B-ArliAI-RPMax-v1.1-GGUF
writing-roleplay-20k-context-nemo-12b-v1.0-GGUF
Gryphe_Pantheon-Proto-RP-1.8-30B-A3B-GGUF
Phi-3-medium-4k-instruct-GGUF
zerofata_MS3.2-PaintedFantasy-Visage-33B-GGUF
Rombo-Org_Rombo-LLM-V3.0-Qwen-32b-GGUF
Menlo_Jan-nano-GGUF
Vikhr-Nemo-12B-Instruct-R-21-09-24-GGUF
TheDrummer_Behemoth-ReduX-123B-v1-GGUF
Replete-Coder-V2-Llama-3.1-8b-GGUF
kalomaze_Qwen3-16B-A3B-GGUF
microsoft_Phi-4-mini-reasoning-GGUF
DeepSeek-V2.5-GGUF
nvidia_OpenReasoning-Nemotron-14B-GGUF
inclusionAI_Ling-1T-GGUF
Llamacpp imatrix Quantizations of Ling-1T by inclusionAI Original model: https://huggingface.co/inclusionAI/Ling-1T All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Ling-1T-Q41.gguf | Q41 | 626.57GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Ling-1T-Q4KM.gguf | Q4KM | 608.42GB | true | Good quality, default size for most use cases, recommended. | | Ling-1T-Q40.gguf | Q40 | 574.66GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Ling-1T-IQ4NL.gguf | IQ4NL | 565.09GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Ling-1T-IQ4XS.gguf | IQ4XS | 534.18GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | Ling-1T-Q3KXL.gguf | Q3KXL | 476.60GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Ling-1T-Q3KL.gguf | Q3KL | 475.47GB | true | Lower quality but usable, good for low RAM availability. | | Ling-1T-Q3KM.gguf | Q3KM | 456.46GB | true | Low quality. | | Ling-1T-IQ3M.gguf | IQ3M | 456.38GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | Ling-1T-Q3KS.gguf | Q3KS | 433.73GB | true | Low quality, not recommended. | | Ling-1T-IQ3XS.gguf | IQ3XS | 409.57GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | Ling-1T-IQ3XXS.gguf | IQ3XXS | 394.91GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | Ling-1T-Q2KL.gguf | Q2KL | 351.17GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Ling-1T-Q2K.gguf | Q2K | 349.92GB | true | Very low quality but surprisingly usable. | | Ling-1T-IQ2M.gguf | IQ2M | 316.09GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Ling-1T-IQ2S.gguf | IQ2S | 277.97GB | true | Low quality, uses SOTA techniques to be usable. | | Ling-1T-IQ2XS.gguf | IQ2XS | 277.17GB | true | Low quality, uses SOTA techniques to be usable. | | Ling-1T-IQ2XXS.gguf | IQ2XXS | 240.46GB | true | Very low quality, uses SOTA techniques to be usable. | | Ling-1T-IQ1M.gguf | IQ1M | 215.36GB | true | Extremely low quality, not recommended. | | Ling-1T-IQ1S.gguf | IQ1S | 206.10GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (inclusionAILing-1T-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
yanolja_YanoljaNEXT-Rosetta-12B-2510-GGUF
Llamacpp imatrix Quantizations of YanoljaNEXT-Rosetta-12B-2510 by yanolja Original model: https://huggingface.co/yanolja/YanoljaNEXT-Rosetta-12B-2510 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | YanoljaNEXT-Rosetta-12B-2510-bf16.gguf | bf16 | 25.55GB | false | Full BF16 weights. | | YanoljaNEXT-Rosetta-12B-2510-Q80.gguf | Q80 | 13.58GB | false | Extremely high quality, generally unneeded but max available quant. | | YanoljaNEXT-Rosetta-12B-2510-Q6KL.gguf | Q6KL | 10.97GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q6K.gguf | Q6K | 10.49GB | false | Very high quality, near perfect, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q5KL.gguf | Q5KL | 9.76GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q5KM.gguf | Q5KM | 9.14GB | false | High quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q5KS.gguf | Q5KS | 8.92GB | false | High quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q4KL.gguf | Q4KL | 8.61GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q41.gguf | Q41 | 8.19GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | YanoljaNEXT-Rosetta-12B-2510-Q4KM.gguf | Q4KM | 7.87GB | false | Good quality, default size for most use cases, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q3KXL.gguf | Q3KXL | 7.79GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | YanoljaNEXT-Rosetta-12B-2510-Q4KS.gguf | Q4KS | 7.50GB | false | Slightly lower quality with more space savings, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q40.gguf | Q40 | 7.48GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | YanoljaNEXT-Rosetta-12B-2510-IQ4NL.gguf | IQ4NL | 7.45GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | YanoljaNEXT-Rosetta-12B-2510-IQ4XS.gguf | IQ4XS | 7.09GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | YanoljaNEXT-Rosetta-12B-2510-Q3KL.gguf | Q3KL | 6.91GB | false | Lower quality but usable, good for low RAM availability. | | YanoljaNEXT-Rosetta-12B-2510-Q3KM.gguf | Q3KM | 6.44GB | false | Low quality. | | YanoljaNEXT-Rosetta-12B-2510-IQ3M.gguf | IQ3M | 6.09GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | YanoljaNEXT-Rosetta-12B-2510-Q2KL.gguf | Q2KL | 6.08GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | YanoljaNEXT-Rosetta-12B-2510-Q3KS.gguf | Q3KS | 5.89GB | false | Low quality, not recommended. | | YanoljaNEXT-Rosetta-12B-2510-IQ3XS.gguf | IQ3XS | 5.64GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | YanoljaNEXT-Rosetta-12B-2510-IQ3XXS.gguf | IQ3XXS | 5.22GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | YanoljaNEXT-Rosetta-12B-2510-Q2K.gguf | Q2K | 5.10GB | false | Very low quality but surprisingly usable. | | YanoljaNEXT-Rosetta-12B-2510-IQ2M.gguf | IQ2M | 4.74GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | YanoljaNEXT-Rosetta-12B-2510-IQ2S.gguf | IQ2S | 4.45GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (yanoljaYanoljaNEXT-Rosetta-12B-2510-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
TQ2.5-14B-Sugarquill-v1-GGUF
Gryphe_Pantheon-RP-1.8-24b-Small-3.1-GGUF
dolphin-2.9.4-gemma2-2b-GGUF
Llama-Sentient-3.2-3B-Instruct-GGUF
all-hands_openhands-lm-32b-v0.1-GGUF
Qwen2.5-Coder-3B-Instruct-abliterated-GGUF
google_gemma-3-1b-it-qat-GGUF
zerofata_L3.3-GeneticLemonade-Opus-70B-GGUF
Yi-Coder-9B-Chat-GGUF
granite-3.1-2b-instruct-GGUF
Qwen2-VL-7B-Instruct-abliterated-GGUF
Llama3.1-8B-Cobalt-GGUF
mistralai_Magistral-Small-2507-GGUF
Qwen2.5-7B-Instruct-1M-GGUF
Anubis-70B-v1-GGUF
Mistral-Nemo-Prism-12B-GGUF
microsoft_Phi-4-reasoning-plus-GGUF
TheDrummer_Fallen-Llama-3.3-R1-70B-v1-GGUF
google_gemma-3-270m-it-qat-GGUF
Sky-T1-32B-Preview-GGUF
Dans-PersonalityEngine-v1.0.0-8b-GGUF
Rombos-Coder-V2.5-Qwen-14b-GGUF
Falcon3-1B-Instruct-GGUF
EVA-LLaMA-3.33-70B-v0.1-GGUF
LGAI-EXAONE_EXAONE-Deep-32B-GGUF
granite-3.1-3b-a800m-instruct-GGUF
deepcogito_cogito-v1-preview-qwen-14B-GGUF
INTELLECT-1-Instruct-GGUF
soob3123_amoral-gemma3-12B-GGUF
InfiniAILab_QwQ-0.5B-GGUF
Steelskull_L3.3-Cu-Mai-R1-70b-GGUF
ibm-granite_granite-3.2-2b-instruct-GGUF
google_txgemma-9b-chat-GGUF
Steelskull_L3.3-Shakudo-70b-GGUF
Qwentile2.5-32B-Instruct-GGUF
arcee-ai_Arcee-Maestro-7B-Preview-GGUF
Llama-3.1-8B-Lexi-Uncensored-GGUF
NousResearch_DeepHermes-3-Mistral-24B-Preview-GGUF
smirki_UIGEN-T1.1-Qwen-14B-GGUF
c4ai-command-r-08-2024-GGUF
THU-KEG_LongWriter-Zero-32B-GGUF
arcee-ai_AFM-4.5B-GGUF
soob3123_Veritas-12B-GGUF
nvidia_Llama-3_3-Nemotron-Super-49B-GenRM-Multilingual-GGUF
Mistral-Large-Instruct-2411-GGUF
AutoCoder-GGUF
Gemma-2-9B-It-SPPO-Iter3-GGUF
Ichigo-llama3.1-s-instruct-v0.4-GGUF
Nemotron-Mini-4B-Instruct-GGUF
PocketDoc_Dans-SakuraKaze-V1.0.0-12b-GGUF
MS-Schisandra-22B-v0.3-GGUF
open-thoughts_OpenThinker3-7B-GGUF
Chocolatine-3B-Instruct-DPO-v1.2-GGUF
EXAONE-3.5-32B-Instruct-GGUF
TheDrummer_Snowpiercer-15B-v4-GGUF
Llama-OpenReviewer-8B-GGUF
PKU-DS-LAB_FairyR1-32B-GGUF
gustavecortal_Beck-1.7B-GGUF
Llamacpp imatrix Quantizations of Beck-1.7B by gustavecortal Original model: https://huggingface.co/gustavecortal/Beck-1.7B All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project No chat template specified so default is used. This may be incorrect, check original model card for details. | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Beck-1.7B-bf16.gguf | bf16 | 3.45GB | false | Full BF16 weights. | | Beck-1.7B-Q80.gguf | Q80 | 1.83GB | false | Extremely high quality, generally unneeded but max available quant. | | Beck-1.7B-Q6KL.gguf | Q6KL | 1.49GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Beck-1.7B-Q6K.gguf | Q6K | 1.42GB | false | Very high quality, near perfect, recommended. | | Beck-1.7B-Q5KL.gguf | Q5KL | 1.33GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Beck-1.7B-Q5KM.gguf | Q5KM | 1.26GB | false | High quality, recommended. | | Beck-1.7B-Q5KS.gguf | Q5KS | 1.23GB | false | High quality, recommended. | | Beck-1.7B-Q4KL.gguf | Q4KL | 1.18GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Beck-1.7B-Q41.gguf | Q41 | 1.14GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Beck-1.7B-Q4KM.gguf | Q4KM | 1.11GB | false | Good quality, default size for most use cases, recommended. | | Beck-1.7B-Q3KXL.gguf | Q3KXL | 1.08GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Beck-1.7B-Q4KS.gguf | Q4KS | 1.06GB | false | Slightly lower quality with more space savings, recommended. | | Beck-1.7B-Q40.gguf | Q40 | 1.06GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Beck-1.7B-IQ4NL.gguf | IQ4NL | 1.05GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Beck-1.7B-IQ4XS.gguf | IQ4XS | 1.01GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Beck-1.7B-Q3KL.gguf | Q3KL | 1.00GB | false | Lower quality but usable, good for low RAM availability. | | Beck-1.7B-Q3KM.gguf | Q3KM | 0.94GB | false | Low quality. | | Beck-1.7B-IQ3M.gguf | IQ3M | 0.90GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Beck-1.7B-Q3KS.gguf | Q3KS | 0.87GB | false | Low quality, not recommended. | | Beck-1.7B-Q2KL.gguf | Q2KL | 0.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Beck-1.7B-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Beck-1.7B-Q2K.gguf | Q2K | 0.78GB | false | Very low quality but surprisingly usable. | | Beck-1.7B-IQ3XXS.gguf | IQ3XXS | 0.75GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Beck-1.7B-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (gustavecortalBeck-1.7B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
CohereForAI_c4ai-command-a-03-2025-GGUF
Hermes-3-Llama-3.1-8B-lorablated-GGUF
Delta-Vector_Austral-Xgen-9B-Winton-GGUF
Reflection-Llama-3.1-70B-GGUF
EXAONE-3.5-7.8B-Instruct-GGUF
TheDrummer_Rivermind-Lux-12B-v1-GGUF
Steelskull_L3.3-Mokume-Gane-R1-70b-v1.1-GGUF
TheDrummer_Gemma-3-R1-27B-v1-GGUF
Llamacpp imatrix Quantizations of Gemma-3-R1-27B-v1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Gemma-3-R1-27B-v1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Gemma-3-R1-27B-v1-bf16.gguf | bf16 | 54.03GB | true | Full BF16 weights. | | Gemma-3-R1-27B-v1-Q80.gguf | Q80 | 28.71GB | false | Extremely high quality, generally unneeded but max available quant. | | Gemma-3-R1-27B-v1-Q6KL.gguf | Q6KL | 22.51GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Gemma-3-R1-27B-v1-Q6K.gguf | Q6K | 22.17GB | false | Very high quality, near perfect, recommended. | | Gemma-3-R1-27B-v1-Q5KL.gguf | Q5KL | 19.61GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Gemma-3-R1-27B-v1-Q5KM.gguf | Q5KM | 19.27GB | false | High quality, recommended. | | Gemma-3-R1-27B-v1-Q5KS.gguf | Q5KS | 18.77GB | false | High quality, recommended. | | Gemma-3-R1-27B-v1-Q41.gguf | Q41 | 17.17GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Gemma-3-R1-27B-v1-Q4KL.gguf | Q4KL | 16.89GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Gemma-3-R1-27B-v1-Q4KM.gguf | Q4KM | 16.55GB | false | Good quality, default size for most use cases, recommended. | | Gemma-3-R1-27B-v1-Q4KS.gguf | Q4KS | 15.67GB | false | Slightly lower quality with more space savings, recommended. | | Gemma-3-R1-27B-v1-Q40.gguf | Q40 | 15.62GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Gemma-3-R1-27B-v1-IQ4NL.gguf | IQ4NL | 15.57GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Gemma-3-R1-27B-v1-Q3KXL.gguf | Q3KXL | 14.88GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-27B-v1-IQ4XS.gguf | IQ4XS | 14.77GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Gemma-3-R1-27B-v1-Q3KL.gguf | Q3KL | 14.54GB | false | Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-27B-v1-Q3KM.gguf | Q3KM | 13.44GB | false | Low quality. | | Gemma-3-R1-27B-v1-IQ3M.gguf | IQ3M | 12.55GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Gemma-3-R1-27B-v1-Q3KS.gguf | Q3KS | 12.17GB | false | Low quality, not recommended. | | Gemma-3-R1-27B-v1-IQ3XS.gguf | IQ3XS | 11.56GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Gemma-3-R1-27B-v1-Q2KL.gguf | Q2KL | 10.85GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Gemma-3-R1-27B-v1-IQ3XXS.gguf | IQ3XXS | 10.72GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Gemma-3-R1-27B-v1-Q2K.gguf | Q2K | 10.50GB | false | Very low quality but surprisingly usable. | | Gemma-3-R1-27B-v1-IQ2M.gguf | IQ2M | 9.49GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Gemma-3-R1-27B-v1-IQ2S.gguf | IQ2S | 8.78GB | false | Low quality, uses SOTA techniques to be usable. | | Gemma-3-R1-27B-v1-IQ2XS.gguf | IQ2XS | 8.44GB | false | Low quality, uses SOTA techniques to be usable. | | Gemma-3-R1-27B-v1-IQ2XXS.gguf | IQ2XXS | 7.69GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerGemma-3-R1-27B-v1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
CollectiveLM-Falcon-3-7B-GGUF
LiquidAI_LFM2-VL-1.6B-GGUF
deepseek-ai_DeepSeek-V3-0324-GGUF
Qwen2.5-Coder-14B-Instruct-abliterated-GGUF
nvidia_OpenCodeReasoning-Nemotron-32B-IOI-GGUF
Qwen2.5-Coder-7B-Instruct-abliterated-GGUF
Marco-o1-GGUF
SuperNova-Medius-GGUF
Qwen_Qwen2.5-VL-72B-Instruct-GGUF
Cydonia-22B-v1-GGUF
Rombos-Coder-V2.5-Qwen-32b-GGUF
microsoft_Phi-4-reasoning-GGUF
UI-TARS-7B-SFT-GGUF
OpenGVLab_InternVL3_5-2B-GGUF
Qwen2.5-Coder-32B-Instruct-abliterated-GGUF
openai_gpt-oss-120b-GGUF-MXFP4-Experimental
Llamacpp experimental Quantizations of gpt-oss-120b by Open AI Using llama.cpp branch `gpt-oss-mxfp4`, PR here: https://github.com/ggml-org/llama.cpp/pull/15091 Original model: https://huggingface.co/openai/gpt-oss-120b This is a single static quant in the new MXFP4 format, rest of sizes will come after PR is merged
Qwen2-0.5B-Instruct-GGUF
SILMA-9B-Instruct-v1.0-GGUF
arcee-ai_Homunculus-GGUF
WhiteRabbitNeo_WhiteRabbitNeo-V3-7B-GGUF
TheDrummer_Gemma-3-R1-12B-v1-GGUF
Llamacpp imatrix Quantizations of Gemma-3-R1-12B-v1 by TheDrummer Original model: https://huggingface.co/TheDrummer/Gemma-3-R1-12B-v1 All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Gemma-3-R1-12B-v1-bf16.gguf | bf16 | 23.54GB | false | Full BF16 weights. | | Gemma-3-R1-12B-v1-Q80.gguf | Q80 | 12.51GB | false | Extremely high quality, generally unneeded but max available quant. | | Gemma-3-R1-12B-v1-Q6KL.gguf | Q6KL | 9.90GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Gemma-3-R1-12B-v1-Q6K.gguf | Q6K | 9.66GB | false | Very high quality, near perfect, recommended. | | Gemma-3-R1-12B-v1-Q5KL.gguf | Q5KL | 8.69GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Gemma-3-R1-12B-v1-Q5KM.gguf | Q5KM | 8.45GB | false | High quality, recommended. | | Gemma-3-R1-12B-v1-Q5KS.gguf | Q5KS | 8.23GB | false | High quality, recommended. | | Gemma-3-R1-12B-v1-Q41.gguf | Q41 | 7.56GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Gemma-3-R1-12B-v1-Q4KL.gguf | Q4KL | 7.54GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Gemma-3-R1-12B-v1-Q4KM.gguf | Q4KM | 7.30GB | false | Good quality, default size for most use cases, recommended. | | Gemma-3-R1-12B-v1-Q4KS.gguf | Q4KS | 6.94GB | false | Slightly lower quality with more space savings, recommended. | | Gemma-3-R1-12B-v1-Q40.gguf | Q40 | 6.91GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Gemma-3-R1-12B-v1-IQ4NL.gguf | IQ4NL | 6.89GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Gemma-3-R1-12B-v1-Q3KXL.gguf | Q3KXL | 6.72GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-12B-v1-IQ4XS.gguf | IQ4XS | 6.55GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Gemma-3-R1-12B-v1-Q3KL.gguf | Q3KL | 6.48GB | false | Lower quality but usable, good for low RAM availability. | | Gemma-3-R1-12B-v1-Q3KM.gguf | Q3KM | 6.01GB | false | Low quality. | | Gemma-3-R1-12B-v1-IQ3M.gguf | IQ3M | 5.66GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Gemma-3-R1-12B-v1-Q3KS.gguf | Q3KS | 5.46GB | false | Low quality, not recommended. | | Gemma-3-R1-12B-v1-IQ3XS.gguf | IQ3XS | 5.21GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Gemma-3-R1-12B-v1-Q2KL.gguf | Q2KL | 5.01GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Gemma-3-R1-12B-v1-IQ3XXS.gguf | IQ3XXS | 4.78GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Gemma-3-R1-12B-v1-Q2K.gguf | Q2K | 4.77GB | false | Very low quality but surprisingly usable. | | Gemma-3-R1-12B-v1-IQ2M.gguf | IQ2M | 4.31GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Gemma-3-R1-12B-v1-IQ2S.gguf | IQ2S | 4.02GB | false | Low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (TheDrummerGemma-3-R1-12B-v1-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
infly_inf-o1-pi0-GGUF
magnum-v4-22b-GGUF
magnum-v2-4b-GGUF
Llama-3-Patronus-Lynx-70B-Instruct-GGUF
YuLan-Mini-GGUF
starcoder2-15b-instruct-v0.1-GGUF
35b-beta-long-GGUF
Deepthink-Reasoning-7B-GGUF
Llama-Doctor-3.2-3B-Instruct-GGUF
deepcogito_cogito-v1-preview-llama-70B-GGUF
OpenThinker-7B-GGUF
Llama-3.1-8B-Open-SFT-GGUF
Nohobby_L3.3-Prikol-70B-EXTRA-GGUF
Sao10K_Llama-3.3-70B-Vulpecula-r1-GGUF
granite-embedding-125m-english-GGUF
Behemoth-123B-v1-GGUF
LongWriter-glm4-9b-abliterated-GGUF
Athene-70B-GGUF
SmolLM2-360M-Instruct-GGUF
deepseek-ai_DeepSeek-V3.1-Terminus-GGUF
Llamacpp imatrix Quantizations of DeepSeek-V3.1-Terminus by deepseek-ai Original model: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-V3.1-Terminus-Q80.gguf | Q80 | 713.29GB | true | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-V3.1-Terminus-Q6K.gguf | Q6K | 552.45GB | true | Very high quality, near perfect, recommended. | | DeepSeek-V3.1-Terminus-Q5KM.gguf | Q5KM | 478.34GB | true | High quality, recommended. | | DeepSeek-V3.1-Terminus-Q5KS.gguf | Q5KS | 463.03GB | true | High quality, recommended. | | DeepSeek-V3.1-Terminus-Q41.gguf | Q41 | 421.04GB | true | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-V3.1-Terminus-Q4KM.gguf | Q4KM | 409.23GB | true | Good quality, default size for most use cases, recommended. | | DeepSeek-V3.1-Terminus-Q4KS.gguf | Q4KS | 394.15GB | true | Slightly lower quality with more space savings, recommended. | | DeepSeek-V3.1-Terminus-Q40.gguf | Q40 | 386.42GB | true | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-V3.1-Terminus-IQ4NL.gguf | IQ4NL | 380.48GB | true | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-V3.1-Terminus-IQ4XS.gguf | IQ4XS | 359.98GB | true | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-V3.1-Terminus-Q3KXL.gguf | Q3KXL | 320.52GB | true | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-V3.1-Terminus-Q3KL.gguf | Q3KL | 319.71GB | true | Lower quality but usable, good for low RAM availability. | | DeepSeek-V3.1-Terminus-Q3KM.gguf | Q3KM | 307.93GB | true | Low quality. | | DeepSeek-V3.1-Terminus-IQ3M.gguf | IQ3M | 307.88GB | true | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-V3.1-Terminus-Q3KS.gguf | Q3KS | 293.35GB | true | Low quality, not recommended. | | DeepSeek-V3.1-Terminus-IQ3XS.gguf | IQ3XS | 277.15GB | true | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-V3.1-Terminus-IQ3XXS.gguf | IQ3XXS | 267.63GB | true | Lower quality, new method with decent performance, comparable to Q3 quants. | | DeepSeek-V3.1-Terminus-Q2KL.gguf | Q2KL | 238.74GB | true | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-V3.1-Terminus-Q2K.gguf | Q2K | 237.83GB | true | Very low quality but surprisingly usable. | | DeepSeek-V3.1-Terminus-IQ2M.gguf | IQ2M | 215.04GB | true | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | DeepSeek-V3.1-Terminus-IQ2S.gguf | IQ2S | 189.63GB | true | Low quality, uses SOTA techniques to be usable. | | DeepSeek-V3.1-Terminus-IQ2XS.gguf | IQ2XS | 188.41GB | true | Low quality, uses SOTA techniques to be usable. | | DeepSeek-V3.1-Terminus-IQ2XXS.gguf | IQ2XXS | 164.06GB | true | Very low quality, uses SOTA techniques to be usable. | | DeepSeek-V3.1-Terminus-IQ1M.gguf | IQ1M | 147.45GB | true | Extremely low quality, not recommended. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (deepseek-aiDeepSeek-V3.1-Terminus-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
Rombos-Coder-V2.5-Qwen-7b-GGUF
Crimson_Dawn-v0.2-GGUF
Hermes-3-Llama-3.1-8B-GGUF
LGAI-EXAONE_EXAONE-4.0.1-32B-GGUF
EVA-Qwen2.5-32B-v0.0-GGUF
miromind-ai_MiroThinker-v1.0-72B-GGUF
Gemmasutra-Mini-2B-v1-GGUF
internlm3-8b-instruct-GGUF
Gemma-2-Ataraxy-9B-GGUF
OpenGVLab_InternVL3_5-4B-GGUF
ddh0_Cassiopeia-70B-GGUF
Tower-Babel_Babel-9B-Chat-GGUF
TheDrummer_Cydonia-24B-v3.1-GGUF
ByteDance-Seed_Seed-OSS-36B-Instruct-GGUF
Llamacpp imatrix Quantizations of Seed-OSS-36B-Instruct by ByteDance-Seed Original model: https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct All quants made using imatrix option with dataset from here combined with a subset of combinedallsmall.parquet from Ed Addario here Run them directly with llama.cpp, or any other llama.cpp based project | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | Seed-OSS-36B-Instruct-bf16.gguf | bf16 | 72.31GB | true | Full BF16 weights. | | Seed-OSS-36B-Instruct-Q80.gguf | Q80 | 38.42GB | false | Extremely high quality, generally unneeded but max available quant. | | Seed-OSS-36B-Instruct-Q6KL.gguf | Q6KL | 30.05GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | Seed-OSS-36B-Instruct-Q6K.gguf | Q6K | 29.67GB | false | Very high quality, near perfect, recommended. | | Seed-OSS-36B-Instruct-Q5KL.gguf | Q5KL | 26.08GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | Seed-OSS-36B-Instruct-Q5KM.gguf | Q5KM | 25.59GB | false | High quality, recommended. | | Seed-OSS-36B-Instruct-Q5KS.gguf | Q5KS | 24.97GB | false | High quality, recommended. | | Seed-OSS-36B-Instruct-Q41.gguf | Q41 | 22.76GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | Seed-OSS-36B-Instruct-Q4KL.gguf | Q4KL | 22.35GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | Seed-OSS-36B-Instruct-Q4KM.gguf | Q4KM | 21.76GB | false | Good quality, default size for most use cases, recommended. | | Seed-OSS-36B-Instruct-Q4KS.gguf | Q4KS | 20.70GB | false | Slightly lower quality with more space savings, recommended. | | Seed-OSS-36B-Instruct-Q40.gguf | Q40 | 20.62GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | Seed-OSS-36B-Instruct-IQ4NL.gguf | IQ4NL | 20.59GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | Seed-OSS-36B-Instruct-Q3KXL.gguf | Q3KXL | 19.84GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | Seed-OSS-36B-Instruct-IQ4XS.gguf | IQ4XS | 19.50GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | Seed-OSS-36B-Instruct-Q3KL.gguf | Q3KL | 19.14GB | false | Lower quality but usable, good for low RAM availability. | | Seed-OSS-36B-Instruct-Q3KM.gguf | Q3KM | 17.62GB | false | Low quality. | | Seed-OSS-36B-Instruct-IQ3M.gguf | IQ3M | 16.50GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | Seed-OSS-36B-Instruct-Q3KS.gguf | Q3KS | 15.86GB | false | Low quality, not recommended. | | Seed-OSS-36B-Instruct-IQ3XS.gguf | IQ3XS | 15.09GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | Seed-OSS-36B-Instruct-Q2KL.gguf | Q2KL | 14.38GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | Seed-OSS-36B-Instruct-IQ3XXS.gguf | IQ3XXS | 14.12GB | false | Lower quality, new method with decent performance, comparable to Q3 quants. | | Seed-OSS-36B-Instruct-Q2K.gguf | Q2K | 13.60GB | false | Very low quality but surprisingly usable. | | Seed-OSS-36B-Instruct-IQ2M.gguf | IQ2M | 12.54GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | | Seed-OSS-36B-Instruct-IQ2S.gguf | IQ2S | 11.61GB | false | Low quality, uses SOTA techniques to be usable. | | Seed-OSS-36B-Instruct-IQ2XS.gguf | IQ2XS | 10.95GB | false | Low quality, uses SOTA techniques to be usable. | | Seed-OSS-36B-Instruct-IQ2XXS.gguf | IQ2XXS | 9.91GB | false | Very low quality, uses SOTA techniques to be usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (ByteDance-SeedSeed-OSS-36B-Instruct-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski