ymcki
Kimi-Linear-48B-A3B-Instruct-GGUF
Llama-3_1-Nemotron-51B-Instruct-GGUF
Original model: https://huggingface.co/nvidia/Llama-31-Nemotron-51B-Instruct-GGUF Important for people who wants to do their own quantitization. The converthftogguf.py in b4380 of llama.cpp doesn't read ropetheta parameter such that it can't generate gguf that can work with prompts longer than 4k tokens. There is currently a PR in llama.cpp to update converthftogguf.py. If you can't wait for the PR to get thru, you can download a working converthftogguf.py from here in this repository before you do the gguf conversion yourself. Starting from b4380 of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported. Please download it and compile it to run the GGUFs in this repository. This modification should support Llama-31-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has noop or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers. Since I am a free user, so for the time being, I only upload models that might be of interest for most people. | Quant Type | imatrix | File Size | Delta Perplexity | KL Divergence | Description | | ---------- | ------- | ----------| ---------------- | ------------- | ----------- | | Q6K | calibrationdatav3 | 42.26GB | -0.002436 ± 0.001565 | 0.003332 ± 0.000014 | Good for Nvidia cards or Apple Silicon with 48GB RAM. Should perform very close to the original | | Q5KM | calibrationdatav3 | 36.47GB | 0.020310 ± 0.002052 | 0.005642 ± 0.000024 | Good for A100 40GB or dual 3090. Better than Q4KM but larger and slower. | | Q4KM | calibrationdatav3 | 31.04GB | 0.055444 ± 0.002982 | 0.012021 ± 0.000052 | Good for A100 40GB or dual 3090. Higher cost performance ratio than Q5KM. | | IQ4NL | calibrationdatav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4XS | | IQ4XS | calibrationdatav3 | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. | | Q40 | calibrationdatav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. | | IQ3M | calibrationdatav3 | 23.49GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 5k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. | | IQ3S | calibrationdatav3 | 22.65GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 7k context. Not recommended for CPU or Apple Silicon due to high computational cost. | | IQ3XXS | calibrationdatav3 | 20.19GB | 0.638630 ± 0.009693 | 0.092827 ± 0.000367 | Largest model that can fit a single 3090 at 13k context. Not recommended for CPU or Apple Silicon due to high computational cost. | | Q3KS | calibrationdatav3 | 22.65GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms | | Q3KS | none | 22.65GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix | Make sure you have llama.cpp compiled. Then create an imatrix with a dataset. Second, find the base values of F16 gguf. Please be warned that the generated base value file is about 10GB. Adjust GPU layers depending on your VRAM. Finally, calculate the perplexity and KL divergence of Q40 gguf. Adjust GPU layers depending on your VRAM. First, make sure you have hugginface-cli installed: First, go to llama.cpp release page and download the appropriate pre-compiled release starting from b4380. If that doesn't work, then download any version of llama.cpp starting from b4380. Compile it, then run Thank you bartowski for providing a README.md to get me started.