GLM-4.7-REAP-218B-A32B-W4A16
477
24
license:apache-2.0
by
0xSero
Language Model
OTHER
218B params
New
477 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
488GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
204GB+ RAM
Code Examples
Compression Pipelinetext
GLM-4.7 (358B, 700GB)
|
v REAP 40% pruning (96/160 experts)
|
GLM-4.7-REAP-218B-A32B (218B, 407GB)
|
v AutoRound W4A16 quantization
|
GLM-4.7-REAP-218B-A32B-W4A16 (218B, 108GB) <-- This model
Total: 6.5x compressionAutoRound Quantization Detailsyaml
bits: 4
group_size: 128
format: auto_round
nsamples: 64
seqlen: 512
dataset: NeelNanda/pile-10kReproduce This Modelbash
# 1. Download the BF16 REAP model
huggingface-cli download 0xSero/GLM-4.7-REAP-218B-A32B --local-dir ./GLM-4.7-REAP-218B-A32B
# 2. Run AutoRound quantization
pip install auto-round
python -c "
from auto_round import AutoRound
ar = AutoRound(
'./GLM-4.7-REAP-218B-A32B',
device='cuda',
device_map='auto',
nsamples=64,
seqlen=512,
batch_size=1
)
ar.quantize_and_save('./GLM-4.7-REAP-218B-A32B-W4A16', format='auto_round')
"
# Takes ~2 hours on 8x H200Citationbibtex
@article{jones2025reap,
title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
author={Jones, et al.},
journal={arXiv preprint arXiv:2505.20877},
year={2025}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.