GLM-4.7-REAP-40-W4A16
190
7
license:apache-2.0
by
0xSero
Language Model
OTHER
4B params
New
190 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
9GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
4GB+ RAM
Code Examples
Compression Pipelinetext
GLM-4.7 (358B, 700GB)
│
▼ REAP 40% expert pruning
│
GLM-4.7-REAP-40 (218B)
│
▼ AutoRound W4A16 quantization
│
GLM-4.7-REAP-40-W4A16 (~108GB) ◀── This model
Total: ~6.5x compressionCombined Datasetbashvllm
vllm serve 0xSero/GLM-4.7-REAP-40-W4A16 \
--tensor-parallel-size 4 \
--trust-remote-code \
--quantization gptq🚀 Deploymentpythontransformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"0xSero/GLM-4.7-REAP-40-W4A16",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-40-W4A16", trust_remote_code=True)Example: Create 40% pruned modelpythonvllm
#!/usr/bin/env python3
"""
AutoRound W4A16 Quantization
Intel's state-of-the-art weight quantization using signed gradient descent.
"""
from auto_round import AutoRound
def quantize_w4a16(
model_path: str,
output_dir: str,
bits: int = 4,
group_size: int = 128,
format: str = "auto_gptq",
):
"""
Quantize model to INT4 weights with FP16 activations.
Args:
model_path: Path to REAP-pruned model
output_dir: Output directory
bits: Weight bit width (4 for W4A16)
group_size: Quantization group size (128 is optimal)
format: Output format (auto_gptq for vLLM compatibility)
"""
ar = AutoRound(
model_path,
scheme="W4A16",
device="cuda",
device_map="auto",
trust_remote_code=True,
batch_size=1,
seqlen=512,
nsamples=64,
)
ar.quantize_and_save(output_dir, format=format)
# Example: Quantize REAP-40 to W4A16
quantize_w4a16(
model_path="./GLM-4.7-REAP-40",
output_dir="./GLM-4.7-REAP-40-W4A16",
)Example: Quantize REAP-40 to W4A16bibtex
@article{lasby2025reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025},
url={https://arxiv.org/abs/2510.13999}
}Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.