Downtown-Case

35 models • 1 total models in database

Sort by:

GLM-4.6-128GB-RAM-IK-GGUF

Quantized for 128GB RAM + single GPU setups, with `IQK` quants for better quality/performance in the size than mainline llama.cpp. Requires ikllama.cpp. I can hit ~6.8 tokens a second textgen on 128GB dual-channel DDR5, single CCD Ryzen 7000 + a single 3090. See ubergarm's model card for more info on running these quants: - The first 6 layers are `IQ3KT` (a less lossy and more GPU optimal 3bpw trellis quant), running under the assumption they will be offloaded to GPU. - Instead of quantizing ffndown asymmetrically, it's quantized the same as ffnup/gate, but the beginning/end layers are `IQ3KS`. Targeting this more finely is a WIP. 117.3GB, for ~11GB-16GB VRAM + 128GB RAM (or longer context). - Dense parts are `IQ4KT` instead of `IQ5KS` to save VRAM. - More layers are `IQ2KL` instead of `IQ3KS` to avoid CPU swapping, and layer 92 was also 'trimmed' since it's not used. - Uses ubergarm's ikllama.cpp imatrix (which should be less lossy without a .gguf -> .dat conversion). - Unsloth bf16 weights used as a base, including its tokenizer bugfixes. - Expert quantization follows Unsloth's IQ2XSS layer scheme, with perplexity 'bumps' boosted. See the quantization dump here: https://huggingface.co/ubergarm/GLM-4.6-GGUF/discussions/2#68dd8ca9cb29272d402f3062 `taskset -c 8-15 nice --20 build/bin/llama-server --cache-type-k q80 --cache-type-v q51 --batchsize 4096 --ubatchsize 4096 --ctx-size 20480 --host 0.0.0.0 --port 5000 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-6])\.ffn.=CUDA0" -ot exps=CPU --parallel 1 --threads 8 --no-mmap --path examples/server/publicmikupad --sql-save-file /home/alpha/FastStorage/SQLSave/sqlite-save.sql --model /path/to/GLM-4.6/24GB+128GBV3/GLM-4.6-IQ2KL-BIG-00001-of-00003.gguf` 6 MoE layers on GPU, adjust with the '6' in `"blk\.([0-6])\.ffn.=CUDA0"` `taskset -c 8-15 ./build/bin/llama-perplexity --ctx-size 2048 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-9])\.ffn.=CUDA0" -ot exps=CPU --no-mmap --file /home/alpha/Models/GGUF/ddh0imatcalibrationdatav2.txt --kl-divergence --kl-divergence-base /home/alpha/Models/GGUF/GLM-4.6-KLD-ref-logits-Q80-ddh0-imat-calibration-data-v2.bin --model /home/alpha/Models/GGUF/GLM-4.6/24GB+128GBV3/GLM-4.6-unsloth.gguf-00001-of-00003.gguf` 126.8GB, for 24GB VRAM + 128GB RAM. Slower, higher quality than V3, 0.081 KLD. - All ffndown layers are 3 bit. The same 'sensitive' up/gate ffns as V3 are still 3-bit. - `IQ3KT` instead of `IQ3KS`, for smaller size and less loss. - The cost: ~15% slower TG than V3 (on my Ryzen 7800). `taskset -c 8-15 nice --20 build/bin/llama-server --cache-type-k q80 --cache-type-v q51 --batchsize 4096 --ubatchsize 4096 --ctx-size 20480 --host 0.0.0.0 --port 5000 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-6])\.ffn.=CUDA0" -ot exps=CPU --parallel 1 --threads 8 --no-mmap --path examples/server/publicmikupad --sql-save-file /home/alpha/FastStorage/SQLSave/sqlite-save.sql --model /path/to/GLM-4.6/24GB+128GBV3/GLM-4.6-IQ2KL-BIG-00001-of-00003.gguf` 3 MoE layers on GPU, adjust with the '6' in `"blk\.([0-6])\.ffn.=CUDA0"` `taskset -c 8-15 ./build/bin/llama-perplexity --ctx-size 2048 -fa -fmoe -ngl 999 -ngld 999 -ot "blk\.([0-9])\.ffn.=CUDA0" -ot exps=CPU --no-mmap --file /home/alpha/Models/GGUF/ddh0imatcalibrationdatav2.txt --kl-divergence --kl-divergence-base /home/alpha/Models/GGUF/GLM-4.6-KLD-ref-logits-Q80-ddh0-imat-calibration-data-v2.bin --model /home/alpha/Models/GGUF/GLM-4.6/24GB+128GBV4/GLM-4.6-slow.gguf-00001-of-00003.gguf` For reference, Unsloth's (130.8GB) Q2KXL has a KL Divergence of ~0.12, and bartowski's 128GB Q2KXL is ~0.155, per AesSedai's benchmarks. Ik quants make a massive difference in this range. Ubergarm's IQ2KL mix has a KLD of .088 at 127.5GB. I'd recommend that as well! With all the hearsay about the effects of context cache quantization( `--cache-type-k`, `--cache-type-v` ), I tested the V4 GGUF at different levels: - q80/q80 is within the margin of error (+0.001 KLD); seemingly very little loss for the huge vram savings. - But some other configurations appears to be reasonably low loss, with q80/q51 (for instance) within the margin of error, and q51/iq4nl (at +0.0045) being quite reasonable for squeezing in a lot of context. Personaly, I use q80/q51 now. - Take this with a grain of salt, as (due to the way the test uses the K/V cache) I haven't confirmed the correlation between KV cache quantization KLD with actual long context inference. KL divergence/perplexity tests are done with AesSedai's wonderful testing data: https://huggingface.co/AesSedai/GLM-4.6-GGUF/discussions/1#68dcb412ae30ad1405dacd9a MoE Experts are generally `IQ2KL`/`IQ3KS` on CPU, or `IQ3KT` if destined for the GPU, with dense layers at higher quants levels like `IQ5KS` for less loss. My hardware is a undervolted 3090, dual channel DDR5 6000, an AMD 7800 CPU and linux, though dual CDD ryzen (or tweaked systems) should be notably faster due to the single CCD bandwidth limit. See the example scripts for quantizing, launching the server, and such. KLD results are not necessarily comparible to other repos (as they were run at 2048 context instead of the default 512), but they will be once I rerun them. - ~~Check perplexity of expert FFNs in each layer.~~ - Make more optimal mixes using the Thireus's perplexity data, as seen in `ExampleScripts/GLM-4.6-expert-sorted-perplexity.txt`. - Find 'point of diminishing returns' for dense layer quantization (`Q6K`?). - Test KLD impact of different tokenembd quantization. Derived from ubergarm's GLM-4.5 (Instruct) quantizations: https://huggingface.co/ubergarm/GLM-4.5-GGUF And GGUF-Tool-Suite: https://github.com/Thireus/GGUF-Tool-Suite

Downtown-Case

GLM-4.6-128GB-RAM-IK-GGUF

GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

ByteDance-Seed_Seed-OSS-36B-Instruct-exl3-4.02bpw-hb8

Seed-OSS-36B-Base-Instruct-Karcher-Merge

ByteDance-Seed_Seed-OSS-36B-Instruct-exl3-4.22bpw-hb8

OpenBuddy_SimpleChat-32B-V1-exl3-4.3bpw-hb8

Star-Command-R-Lite-32B-v1-exl2-4bpw

ByteDance-Seed_Seed-OSS-36B-Instruct-exl3-3.22bpw-hb6

ByteDance-Seed_Seed-OSS-36B-Base-woSyn-exl3-3.22bpw-hb6

ByteDance-Seed_Seed-OSS-36B-Base-exl3-4.22bpw-hb8

jukofyork_command-r-35b-writer-v3-exl3-3.75bpw-hb6

ByteDance-Seed_Seed-OSS-36B-Base-woSyn-exl3-4.22bpw-hb8

Seed-OSS-36B-Base-Instruct-Karcher-Merge-exl3-4.22bpw-hb8

CohereForAI_c4ai-command-r-08-2024-exl2-3.75bpw

c4ai-command-a-03-2025-exl3-3.12bpw-hb6

internlm2_5-7b-chat-1m-llamafied-Q6K-GGUF

Tifa-Deepsex-14b-CoT-Chat-HF

Qwen_Qwen2.5-32B-Base-exl2-3.75bpw

Qwen2.5-32B-EVA-Instruct-Merge-0.1

meta-llama_Meta-Llama-3.1-8B-Instruct-exl2-8bpw

Qwen_Qwen2.5-32B-Base-exl2-3.62bpw

deepseek-ai_DeepSeek-R1-Distill-Qwen-32B-exl2-4.5bpw-8K-Cal

Tifa-Deepsex-14b-CoT-Crazy-HF

EVA-UNIT-01_EVA-Qwen2.5-32B-v0.1-exl2-4.1bpw

nbeerbower_EVA-Gutenberg3-Qwen2.5-32B-exl3-4.0bpw-hb8

OpenBuddy_CoTGen-32B-V1-exl3-4.3bpw-hb8

internlm2_5-7b-chat-1m-llamafied

internlm_internlm2_5-20b-llamafied-hacked-rope

internlm2_5-7b-chat-1m-llamafied-6bpw-exl2

Tess-2.0-RPMerge-SlerpMerge

aws-prototyping_MegaBeam-Mistral-7B-512K-exl2-8.0bpw

Qwen_Qwen2.5-32B-Base-exl2-3.92bpw

nbeerbower_EVA-Gutenberg3-Qwen2.5-32B-exl2-5bpw-8K-Cal

Deepseek-EVA-32B-SCE-v1

Deepseek-EVA-32B-DELLA-v1