MidnightPhreaker
KAT-Dev-72B-Exp-GPTQ-INT4-gs128
GLM 4.5 Air REAP 82B A12B GPTQ INT4 Gs32
GLM-4.5-Air-REAP-82B-A12B - GPTQ INT4 (groupsize=32) This is a GPTQ quantized version of cerebras/GLM-4.5-Air-REAP-82B-A12B. - Type: Mixture of Experts (MoE) - Total Parameters: 82B (160 routed experts + 1 shared expert) - Active Parameters: ~12B per token (8 experts selected via sparse gating) - Layers: 92 transformer layers - Architecture: Glm4MoeForCausalLM - Method: GPTQ (GPT Quantization) - Bits: 4 - Group Size: 32 - Quantization Type: INT - Symmetric: True - Dampening: 0.05 (MoE-optimized, lower than typical 0.1) - Calibration Samples: 128 - Calibration Dataset: allenai/c4 - Max Sequence Length: 512 - Note: ALL 160 routed experts + 1 shared expert are quantized to 4-bit - 6x NVIDIA GeForce RTX 5090 (32GB each) - CUDA 12.8+ - Sequential layer-by-layer processing (OOM-safe) This quantized model offers: - ~4x memory reduction compared to FP16 - Faster inference on compatible hardware - Maintained accuracy through GPTQ quantization - Efficient MoE inference (only 8 active experts per token) - NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer) - Minimum 32GB VRAM for single-GPU inference - Multi-GPU setup recommended for larger batch sizes - Base Model: cerebras/GLM-4.5-Air-REAP-82B-A12B - Quantization Tool: llm-compressor - Compatible Inference Engines: vLLM, TGI (Text Generation Inference) - MoE Support: Full support for sparse expert routing - Quantization may affect model accuracy on certain tasks - Requires vLLM or compatible inference engine for optimal MoE performance - All experts are quantized (not just the active 8) - Base model: cerebras/GLM-4.5-Air-REAP-82B-A12B - Quantization: llm-compressor - Inference: vLLM
KAT-Dev-72B-Exp-GPTQ-INT4-gs32
This is a GPTQ quantized version of Kwaipilot/KAT-Dev-72B-Exp. - Method: GPTQ (GPT Quantization) - Bits: 4 - Group Size: 32 - Quantization Type: INT - Symmetric: True - Calibration Samples: 128 - Calibration Dataset: allenai/c4 - Max Sequence Length: 512 - 6x NVIDIA GeForce RTX 5090 (32GB each) - CUDA 12.8+ - Sequential layer-by-layer processing (OOM-safe) This quantized model offers: - ~4x memory reduction compared to FP16 - Faster inference on compatible hardware - Maintained accuracy through GPTQ quantization - NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer) - Minimum 24GB VRAM for single-GPU inference - Multi-GPU setup for larger batch sizes - Base Model: Kwaipilot/KAT-Dev-72B-Exp - Quantization Tool: llm-compressor - Compatible Inference Engines: vLLM, TGI (Text Generation Inference) - Quantization may affect model accuracy on certain tasks - Requires vLLM or compatible inference engine for optimal performance - Base model: Kwaipilot/KAT-Dev-72B-Exp - Quantization: llm-compressor - Inference: vLLM
DeepSeek-Coder-33B-Instruct-NVFP4
FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-ao-int4wo-gs128
FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-bnb-4bit
FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-bnb-4bit-fp4
nomic-embed-text-v1.5
nomic-embed-text-v1.5-onnx
KAT-Dev-72B-Exp-GPTQ-INT4-gs32-0.01
This is a GPTQ quantized version of Kwaipilot/KAT-Dev-72B-Exp. - Method: GPTQ (GPT Quantization) - Bits: 4 - Group Size: 32 - Quantization Type: INT - Symmetric: True - Calibration Samples: 128 - Calibration Dataset: allenai/c4 - Max Sequence Length: 512 - 6x NVIDIA GeForce RTX 5090 (32GB each) - CUDA 12.8+ - Sequential layer-by-layer processing (OOM-safe) This quantized model offers: - ~4x memory reduction compared to FP16 - Faster inference on compatible hardware - Maintained accuracy through GPTQ quantization - NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer) - Minimum 24GB VRAM for single-GPU inference - Multi-GPU setup for larger batch sizes - Base Model: Kwaipilot/KAT-Dev-72B-Exp - Quantization Tool: llm-compressor - Compatible Inference Engines: vLLM, TGI (Text Generation Inference) - Quantization may affect model accuracy on certain tasks - Requires vLLM or compatible inference engine for optimal performance - Base model: Kwaipilot/KAT-Dev-72B-Exp - Quantization: llm-compressor - Inference: vLLM