GLM-4.6-128GB-RAM-IK-GGUF

202
19
4.6B
2 languages
ik_llama.cpp
by
Downtown-Case
Language Model
OTHER
4.6B params
New
202 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
11GB+ RAM
Mobile
Laptop
Server
Quick Summary

Quantized for 128GB RAM + single GPU setups, with `IQK` quants for better quality/performance in the size than mainline llama.

Device Compatibility

Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
5GB+ RAM

Code Examples

V1 (Obsolete):text
# Attention (GPU)
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks

# First 3 Dense Layers [0-2] (GPU)
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Shared Expert Layers [3-92] (GPU)
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks

# Routed Experts Layers [3-6] (GPU)
blk\.[3-6]\.ffn_down_exps\.weight=iq3_kt
blk\.[3-6]\.ffn_(gate|up)_exps\.weight=iq3_kt

# Routed Experts Layers [7-19] (CPU)
blk\.[7-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[7-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.[1-1][0-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[1-1][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks

# Routed Experts Layers [81-92] (CPU)
blk\.[8-8][1-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[8-8][1-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
blk\.[9-9][0-2]\.ffn_down_exps\.weight=iq3_ks
blk\.[9-9][0-2]\.ffn_(gate|up)_exps\.weight=iq3_ks

# Routed Experts Layers [20-80] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# NextN MTP Layer [92] (Unused, not loaded from disk)
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
text
====== Perplexity statistics ======
Mean PPL(Q)                   :   8.894389 ±   0.158902
Mean PPL(base)                :   8.453608 ±   0.150165
Cor(ln(PPL(Q)), ln(PPL(base))):  98.45%
Mean ln(PPL(Q)/PPL(base))     :   0.050827 ±   0.003141
Mean PPL(Q)/PPL(base)         :   1.052141 ±   0.003305
Mean PPL(Q)-PPL(base)         :   0.440781 ±   0.028588

====== KL divergence statistics ======
Mean    KLD:   0.080672 ±   0.001829
Maximum KLD:  18.875971
99.9%   KLD:   3.780314
99.0%   KLD:   1.146402
95.0%   KLD:   0.295799
90.0%   KLD:   0.153155
Median  KLD:   0.020809
10.0%   KLD:   0.000088
 5.0%   KLD:   0.000023
 1.0%   KLD:   0.000002
 0.1%   KLD:  -0.000002
Minimum KLD:  -0.000217

====== Token probability statistics ======
Mean    Δp: -1.231 ± 0.059 %
Maximum Δp: 99.351%
99.9%   Δp: 48.984%
99.0%   Δp: 19.863%
95.0%   Δp:  6.112%
90.0%   Δp:  2.490%
75.0%   Δp:  0.155%
Median  Δp: -0.009%
25.0%   Δp: -0.731%
10.0%   Δp: -5.427%
 5.0%   Δp: -11.712%
 1.0%   Δp: -44.734%
 0.1%   Δp: -88.132%
Minimum Δp: -99.827%
RMS Δp    :  9.421 ± 0.179 %
Same top p: 86.479 ± 0.216 %

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.