inferencerlabs
GLM-5.1-MLX-4.8bit
GLM-5-MLX-4.8bit
GLM-5.1-MLX-2.5bit-INF
GLM-5-MLX-5.6bit-INF
NVIDIA-Nemotron-3-Super-120B-A12B-MLX-9bit
DeepSeek-V3.2-MLX-5.5bit
Qwen3.5-397B-A17B-MLX-9bit
MiniMax-M2.7-MLX-9bit
NVIDIA-Nemotron-3-Super-120B-A12B-MLX-4.5bit
gemma-4-31B-MLX-9bit
GLM-5.1-MLX-4.8bit-INF
Mistral-Small-4-119B-2603-MLX-4.5bit
Kimi-K2-Instruct-MLX-3.9bit
openai-gpt-oss-120b-MLX-6.5bit
See gpt-oss-120b 6.5bit MLX in action - demonstration video q6.5bit quant typically achieves 1.128 perplexity in our testing which is equivalent to q8. | Quantization | Perplexity | |:------------:|:----------:| | q2 | 41.293 | | q3 | 1.900 | | q4 | 1.168 | | q6 | 1.128 | | q8 | 1.128 | Tested to run with Inferencer app Memory usage: ~95 GB (down from ~251GB required by native MXFP4 format) Expect ~60 tokens/s Quantized with a modified version of MLX 0.26 For more details see demonstration video or visit OpenAI gpt-oss-20b. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.
Kimi-K2-Instruct-MLX-3.985bit
openai-gpt-oss-20b-MLX-6.5bit
See gpt-oss-20b 6.5bit MLX in action - demonstration video q6.5bit quant typically achieves 1.128 perplexity in our testing which is equivalent to q8. | Quantization | Perplexity | |:------------:|:----------:| | q2 | 41.293 | | q3 | 1.900 | | q4 | 1.168 | | q6 | 1.128 | | q8 | 1.128 | Tested to run with Inferencer app Memory usage: ~17 GB (down from ~46GB required by native MXFP4 format) Expect ~100 tokens/s Quantized with a modified version of MLX 0.26 For more details see demonstration video or visit OpenAI gpt-oss-20b. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.
gemma-4-E4B-MLX-9bit
gemma-4-26B-A4B-MLX-9bit
Qwen3.5-122B-A10B-MLX-9bit
Qwen3.5-122B-A10B-MLX-6.5bit
gemma-4-E2B-MLX-9bit
GLM-4.7-Flash-MLX-6.5bit
DeepSeek-V3.2-Speciale-MLX-4.8bit
Qwen3-Coder-480B-A35B-Instruct-MLX-8.5bit
Kimi-K2-Instruct-0905-MLX-3.825bit
Mistral-Small-4-119B-2603-MLX-9bit
DeepSeek-V3.2-Speciale-MLX-5.5bit
sarvamai-105b-MLX-10bit
DeepSeek-V3.2-MLX-4.8bit
deepseek-v3.1-MLX-5.5bit
See DeepSeek-V3.1 5.5bit MLX in action - demonstration video q5.5bit quant typically achieves 1.141 perplexity in our testing | Quantization | Perplexity | |:------------:|:----------:| | q2.5 | 41.293 | | q3.5 | 1.900 | | q4.5 | 1.168 | | q5.5 | 1.141 | | q6.5 | 1.128 | | q8.5 | 1.128 | Runs on a single M3 Ultra 512GB RAM using Inferencer app Memory usage: ~480 GB Expect ~13-19 tokens/s Quantized with a modified version of MLX 0.26 For more details see demonstration video or visit DeepSeek-V3.1. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.
Devstral-Small-2-24B-Instruct-2512-MLX-6.5bit
Kimi-K2-Thinking-MLX-4.25bit
See Kimi-K2-Thinking 4.25bit MLX in action - demonstration video q4.25bit quant perplexity TBA, but q4.5bit quant typically achieves 1.168 perplexity in our testing | Quantization | Perplexity | |:------------:|:----------:| | q2.5 | 41.293 | | q3.5 | 1.900 | | q3.95 | 1.243 | | q4.25 | TBA | | q4.5 | 1.168 | | q6.5 | 1.128 | | q8.5 | 1.128 | Tested on a M3 Ultra 512GB RAM connected to MBP 128GB RAM using Inferencer app v1.6 with distributed compute For more information on the distributed compute feature see: github.com/inferencer/issues/31 Memory usage: MBP ~80GB + Mac Studio ~450GB Expect ~22 tokens/s @ 1000 tokens Quantized with a modified version of MLX 0.28 For more details see demonstration video or visit Kimi-K2-Thinking. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.
deepseek-v3.1-Terminus-MLX-5.5bit
Devstral-2-123B-Instruct-2512-MLX-6.5bit
GLM-4.7-Flash-MLX-5.5bit
sarvamai-30b-MLX-10bit
Qwen3-Coder-480B-A35B-Instruct-MLX-6.5bit
See Qwen3-Coder-480B-A35B-Instruct 6.5bit MLX in action - demonstration video q6.5bit quant typically achieves 1.128 perplexity in our testing which is equivalent to q8. | Quantization | Perplexity | |:------------:|:----------:| | q2 | 41.293 | | q3 | 1.900 | | q4 | 1.168 | | q6 | 1.128 | | q8 | 1.128 | Tested to run with Inferencer app Memory usage: ~365 GB Expect ~19 tokens/s Quantized with a modified version of MLX 0.26 For more details see demonstration video or visit Qwen3-Coder. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.
Qwen3.5-35B-A3B-MLX-9bit
Kimi-K2-Instruct-0905-MLX-3.8bit
Kimi-K2-Thinking-MLX-3.8bit
Kimi-K2-Instruct-0905-MLX-3.824bit
Qwen3-Coder-30B-A3B-Instruct-MLX-6.5bit
Kimi-K2.5-MLX-4.2bit
LongCat-Flash-Thinking-2601-MLX-6.5bit
MiniMax-M2-MLX-6.5bit
See MiniMax-M2 6.5bit MLX in action - demonstration video q6.5bit quant typically achieves 1.128 perplexity in our testing which is equivalent to q8. | Quantization | Perplexity | |:------------:|:----------:| | q2.5 | 41.293 | | q3.5 | 1.900 | | q4.5 | 1.168 | | q5.5 | 1.141 | | q6.5 | 1.128 | | q8.5 | 1.128 | Tested on a MacBook Pro connecting to a M3 Ultra 512GB RAM over the internet using Inferencer app v1.5.4 Memory usage: ~175 GB Expect 42 tokens/s for small contexts (200 tokens) down to 12 token/s for large (6800 tokens) Note: Performance has been improved since original tests by 16.7% see: github.com/inferencer/issues/46 Quantized with a modified version of MLX 0.28 For more details see demonstration video or visit MiniMax-M2. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.
GLM-4.7-MLX-9bit
GLM-4.6-MLX-6.5bit
See GLM-4.6 6.5bit MLX in action - demonstration video q6.5bit quant typically achieves the highest perplexity in our testing | Quantization | Perplexity | |:------------:|:----------:| | q2.5 | 41.293 | | q3.5 | 1.900 | | q4.5 | 1.168 | | q5.5 | 1.141 | | q6.5 | 1.128 | | q8.5 | 1.128 | Runs on a single M3 Ultra 512GB RAM using Inferencer app Memory usage: ~270 GB Expect ~16 tokens/s Quantized with a modified version of MLX 0.27 For more details see demonstration video or visit GLM-4.6. We are not the creator, originator, or owner of any model listed. Each model is created and provided by third parties. Models may not always be accurate or contextually appropriate. You are responsible for verifying the information before making important decisions. We are not liable for any damages, losses, or issues arising from its use, including data loss or inaccuracies in AI-generated content.