mlx-community
✓ VerifiedCommunityApple MLX framework community contributions
gpt-oss-20b-MXFP4-Q8
--- license: apache-2.0 pipeline_tag: text-generation library_name: mlx tags: - vllm - mlx base_model: openai/gpt-oss-20b ---
parakeet-tdt-0.6b-v3
--- library_name: mlx language: - en - es - fr - de - bg - hr - cs - da - nl - et - fi - el - hu - it - lv - lt - mt - pl - pt - ro - sk - sl - sv - ru - uk tags: - mlx - automatic-speech-recognition - speech - audio - FastConformer - Conformer - Parakeet license: cc-by-4.0 pipeline_tag: automatic-speech-recognition base_model: nvidia/parakeet-tdt-0.6b-v3 ---
parakeet-tdt-0.6b-v2
--- library_name: mlx tags: - mlx - automatic-speech-recognition - speech - audio - FastConformer - Conformer - Parakeet license: cc-by-4.0 pipeline_tag: automatic-speech-recognition base_model: nvidia/parakeet-tdt-0.6b-v2 ---
whisper-small-mlx
gemma-3-12b-it-qat-4bit
gemma-3-27b-it-qat-4bit
mlx-community/gemma-3-27b-it-qat-4bit This model was converted to MLX format from [`google/gemma-3-27b-it-qat-q40-unquantized`]() using mlx-vlm version 0.1.23. Refer to the original model card for more details on the model. Use with mlx
gemma-3-4b-it-qat-4bit
Qwen3-30B-A3B-Instruct-2507-4bit
This model mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit was converted to MLX format from Qwen/Qwen3-30B-A3B-Instruct-2507 using mlx-lm version 0.26.3.
gemma-3-1b-it-qat-4bit
The Model mlx-community/gemma-3-1b-it-qat-4bit was converted to MLX format from google/gemma-3-1b-it-qat-q40 using mlx-lm version 0.22.5.
Llama-3.2-1B-Instruct-4bit
Llama-3.2-3B-Instruct-4bit
gemma-2-2b-it-4bit
whisper-large-v3-mlx
Qwen3-1.7B-4bit
Meta-Llama-3.1-8B-Instruct-4bit
Qwen3-VL-4B-Instruct-4bit
DeepSeek-OCR-8bit
mlx-community/DeepSeek-OCR-8bit This model was converted to MLX format from [`deepseek-ai/DeepSeek-OCR`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
Qwen3-Embedding-0.6B-4bit-DWQ
Mistral-7B-Instruct-v0.3-4bit
gemma-3-1b-it-4bit
gemma-3n-E4B-it-lm-4bit
gemma-3n-E2B-it-lm-4bit
Kokoro-82M-bf16
mlx-community/Kokoro-82M-bf16 This model was converted to MLX format from [`hexagrad/Kokoro-82M`]() using mlx-audio version 0.0.1. Refer to the original model card for more details on the model. Use with mlx
Qwen3-4B-4bit
Qwen3-0.6B-4bit
MiniMax-M2-4bit
This model mlx-community/MiniMax-M2-4bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
Qwen2-VL-2B-Instruct-4bit
whisper-large-v3-turbo
whisper-large-v3-turbo This model was converted to MLX format from [`large-v3-turbo`]().
Qwen2.5-3B-Instruct-4bit
bge-small-en-v1.5-bf16
Qwen3-VL-2B-Instruct-4bit
mlx-community/Qwen3-VL-2B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen3-235B-A22B-4bit
Dolphin3.0-Llama3.1-8B-4bit
DeepSeek-OCR-4bit
SmolLM3-3B-4bit
Qwen2.5-0.5B-Instruct-4bit
Qwen3-235B-A22B-8bit
Qwen1.5-0.5B-Chat-4bit
DeepSeek-R1-Distill-Qwen-1.5B-8bit
Qwen3-4B-Thinking-2507-4bit
GLM-4.6-4bit
This model mlx-community/GLM-4.6-4bit was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.1.
DeepSeek-OCR-5bit
mlx-community/DeepSeek-OCR-5bit This model was converted to MLX format from [`deepseek-ai/DeepSeek-OCR`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-1.5B-Instruct-4bit
MiniMax-M2-3bit
DeepSeek-R1-Distill-Qwen-1.5B-4bit
Kimi-Linear-48B-A3B-Instruct-4bit
This model mlx-community/Kimi-Linear-48B-A3B-Instruct-4bit was converted to MLX format from moonshotai/Kimi-Linear-48B-A3B-Instruct using mlx-lm version 0.28.4.
whisper-medium-mlx
Qwen3-4B-Instruct-2507-4bit
granite-4.0-h-micro-4bit
MiniMax-M2-8bit
This model mlx-community/MiniMax-M2-8bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
Qwen3-8B-4bit
Llama-3.2-3B-Instruct-8bit
Meta-Llama-3-8B-Instruct-4bit
mlx-community/Meta-Llama-3-8B-Instruct-4bit This model was converted to MLX format from [`meta-llama/Meta-Llama-3-8B-Instruct`]() using mlx-lm version 0.9.0. Refer to the original model card for more details on the model. Use with mlx
SmolVLM2-500M-Video-Instruct-mlx
Qwen3-Embedding-4B-4bit-DWQ
Josiefied-Qwen3-4B-abliterated-v1-4bit
mlx-community/Josiefied-Qwen3-4B-abliterated-v1-4bit This model mlx-community/Josiefied-Qwen3-4B-abliterated-v1-4bit was converted to MLX format from Goekdeniz-Guelmez/Josiefied-Qwen3-4B-abliterated-v1 using mlx-lm version 0.24.0.
Qwen2.5-7B-Instruct-4bit
Phi-4-mini-instruct-4bit
phi-2
MiniMax-M2-mlx-8bit-gs32
This model mlx-community/MiniMax-M2-mlx-8bit-gs32 was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.1. Recipe: 8-bit group-size 32 9 bits per weight (bpw) You can find more similar MLX model quants for a single Apple Mac Studio M3 Ultra with 512 GB at https://huggingface.co/bibproj
granite-3.3-2b-instruct-4bit
Llama-3.3-70B-Instruct-4bit
GLM-4.6-mlx-8bit-gs32
This model mlx-community/GLM-4.6-mlx-8bit-gs32 was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.1. Recipe: 8-bit group-size 32 9 bits per weight (bpw) You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj
Qwen3-VL-235B-A22B-Instruct-3bit
mlx-community/Qwen3-VL-235B-A22B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-235B-A22B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Llama-2-7b-chat-mlx
granite-4.0-h-tiny-4bit
DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx
Qwen2.5-VL-3B-Instruct-4bit
gemma-3-270m-it-8bit
whisper-base-mlx
Qwen3-Embedding-8B-4bit-DWQ
parakeet-rnnt-0.6b
gpt-oss-120b-MXFP4-Q4
This model mlx-community/gpt-oss-120b-MXFP4-Q4 was converted to MLX format from openai/gpt-oss-120b using mlx-lm version 0.27.0.
Phi-3.5-mini-instruct-4bit
LFM2-2.6B-4bit
This model mlx-community/LFM2-2.6B-4bit was converted to MLX format from LiquidAI/LFM2-2.6B using mlx-lm version 0.28.0.
LFM2-8B-A1B-4bit
gpt-oss-20b-MXFP4-Q4
GLM-4.6-bf16
This model mlx-community/GLM-4.6-bf16 was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.2.
Qwen3-0.6B-8bit
LFM2-1.2B-4bit
whisper-large-v2-mlx
gemma-3n-E4B-it-4bit
Qwen3-VL-30B-A3B-Instruct-4bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Dolphin3.0-Llama3.1-8B-8bit
MiniMax M2 6bit
This model mlx-community/MiniMax-M2-6bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
deepcogito-cogito-v1-preview-llama-3B-4bit
Hermes-3-Llama-3.2-3B-4bit
DeepSeek-R1-Distill-Qwen-32B-4bit
Qwen3-VL-30B-A3B-Instruct-8bit
dolphin3.0-llama3.2-3B-4Bit
Phi-4-mini-instruct-8bit
GLM-4.5-Air-3bit
This model mlx-community/GLM-4.5-Air-3bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.1.
DeepSeek-V3-0324-4bit
GLM-4.6-5bit
This model mlx-community/GLM-4.6-5bit was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.1.
gemma-2-9b-it-4bit
gpt-oss-120b-MXFP4-Q8
This model mlx-community/gpt-oss-120b-MXFP4-Q8 was converted to MLX format from openai/gpt-oss-120b using mlx-lm version 0.27.0.
DeepSeek-OCR-6bit
mlx-community/DeepSeek-OCR-6bit This model was converted to MLX format from [`deepseek-ai/DeepSeek-OCR`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
Josiefied-Qwen3-1.7B-abliterated-v1-4bit
DeepSeek-R1-0528-Qwen3-8B-4bit
Gemma 3 4b It 4bit
mlx-community/gemma-3-4b-it-4bit This model was converted to MLX format from [`google/gemma-3-4b-it`]() using mlx-vlm version 0.1.18. Refer to the original model card for more details on the model. Use with mlx
SmolLM-135M-Instruct-4bit
whisper-tiny
Mistral-Nemo-Instruct-2407-4bit
LFM2-8B-A1B-3bit-MLX
Maintainer / Publisher: Susant Achary Upstream model: LiquidAI/LFM2-8B-A1B This repo (MLX 3-bit): `mlx-community/LFM2-8B-A1B-3bit-MLX` This repository provides an Apple-Silicon-optimized MLX build of LFM2-8B-A1B at 3-bit quantization. 3-bit is an excellent size↔quality sweet spot on many Macs—very small memory footprint with surprisingly solid answer quality and snappy decoding. - Architecture: Mixture-of-Experts (MoE) Transformer. - Size: ~8B total parameters with ~1B active per token (the “A1B” naming commonly indicates ~1B active params). - Why MoE? Per token, only a subset of experts is activated → lower compute per token while retaining a larger parameter pool for expressivity. > Memory reality on a single device: Even though ~1B parameters are active at a time, all experts typically reside in memory in single-device runs. Plan RAM based on total parameters, not just the active slice. - `config.json` (MLX), `mlxmodel.safetensors` (3-bit shards) - Tokenizer: `tokenizer.json`, `tokenizerconfig.json` - Metadata: `modelindex.json` (and/or processor metadata as applicable) Target: macOS on Apple Silicon (M-series) using Metal/MPS. - General instruction following, chat, and summarization - RAG back-ends and long-context assistants on device - Schema-guided structured outputs (JSON) where low RAM is a priority - 3-bit is lossy: tiny improvements in latency/RAM come with some accuracy trade-off vs 6/8-bit. - For very long contexts and/or batching, KV-cache can dominate memory—tune `maxtokens` and batch size. - Add your own guardrails/safety for production deployments. You asked to assume and decide realistic ranges. The numbers below are practical starting points—verify on your machine. - Weights (3-bit): ≈ `totalparams × 0.375 byte` → for 8B params ≈ ~3.0 GB - Runtime overhead: MLX graph/tensors/metadata → ~0.6–1.0 GB - KV-cache: grows with context × layers × heads × dtype → ~0.8–2.5+ GB | Context window | Estimated peak RAM | |---|---:| | 4k tokens | ~4.4–5.5 GB | | 8k tokens | ~5.2–6.6 GB | | 16k tokens | ~6.5–8.8 GB | > For ≤2k windows you may see ~4.0–4.8 GB. Larger windows/batches increase KV-cache and peak RAM. 🧭 Precision choices for LFM2-8B-A1B (lineup planning) While this card is 3-bit, teams often publish multiple precisions. Use this table as a planning guide (8B MoE LM; actuals depend on context/batch/prompts): | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to choose | |---|---:|:---:|---|---| | 3-bit (this repo) | ~4.4–8.8 GB | 🔥🔥🔥🔥 | Direct, concise, great latency | Default on 8–16 GB Macs | | 6-bit | ~7.5–12.5 GB | 🔥🔥 | Best quality under quant | Choose if RAM allows | | 8-bit | ~9.5–12+ GB | 🔥🔥 | Largest quantized size / highest fidelity | When you prefer simpler 8-bit workflows | > MoE caveat: MoE lowers compute per token; unless experts are paged/partitioned, memory still scales with total parameters on a single device. Deterministic generation ```bash python -m mlxlm.generate \ --model mlx-community/LFM2-8B-A1B-3bit-MLX \ --prompt "Summarize the following in 5 concise bullet points:\n " \ --max-tokens 256 \ --temperature 0.0 \ --device mps \ --seed 0
MiniMax M2 5bit
This model mlx-community/MiniMax-M2-5bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
gemma-3-text-4b-it-4bit
Qwen3-Coder-30B-A3B-Instruct-4bit
granite-4.0-micro-8bit
This model mlx-community/granite-4.0-micro-8bit was converted to MLX format from ibm-granite/granite-4.0-micro using mlx-lm version 0.28.2.
Phi-3-mini-4k-instruct-4bit
Qwen3-VL-235B-A22B-Thinking-3bit
Llama-3.2-11B-Vision-Instruct-abliterated
Kimi-Dev-72B-4bit-DWQ
Qwen3-VL-30B-A3B-Instruct-bf16
mlx-community/Qwen3-VL-30B-A3B-Instruct-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
whisper-small-mlx-8bit
Llama-3.2-1B-Instruct-8bit
DeepSeek-R1-Distill-Qwen-7B-4bit
GLM-4.5-Air-4bit
Qwen2.5-VL-7B-Instruct-4bit
LFM2-350M-8bit
DeepSeek-R1-4bit
Meta-Llama-3.1-70B-Instruct-4bit
Kimi-K2-Instruct-4bit
phi-4-8bit
3b-de-ft-research_release-4bit
whisper-large-mlx
dolphin3.0-llama3.2-1B-4Bit
Qwen3-VL-8B-Instruct-4bit
mlx-community/Qwen3-VL-8B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Llama-3.2-11B-Vision-Instruct-8bit
Qwen3-4B-8bit
gemma-3-4b-it-8bit
gemma-3n-E2B-it-4bit
DeepSeek-V3.1-4bit
Qwen3-VL-8B-Thinking-8bit
mlx-community/Qwen3-VL-8B-Thinking-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Ring-mini-linear-2.0-4bit
This model mlx-community/Ring-mini-linear-2.0-4bit was converted to MLX format from inclusionAI/Ring-mini-linear-2.0 using mlx-lm version 0.28.1.
Qwen3-VL-30B-A3B-Thinking-4bit
Qwen3-4B-Instruct-2507-4bit-DWQ-2510
This model mlx-community/Qwen3-4B-Instruct-2507-4bit-DWQ-2510 was converted to MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.28.2.
Qwen3-Coder-30B-A3B-Instruct-4bit-dwq-v2
Qwen3-Coder-30B-A3B-Instruct-8bit
This model mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit was converted to MLX format from Qwen/Qwen3-Coder-30B-A3B-Instruct using mlx-lm version 0.26.1.
Qwen3-VL-30B-A3B-Thinking-3bit
Qwen3-VL-30B-A3B-Thinking-8bit
Qwen3-Coder-480B-A35B-Instruct-4bit
DeepSeek-V3.1-Terminus-4bit
This model mlx-community/DeepSeek-V3.1-Terminus-4bit was converted to MLX format from deepseek-ai/DeepSeek-V3.1-Terminus using mlx-lm version 0.27.1.
whisper-large-v3-turbo-q4
Qwen3-VL-30B-A3B-Thinking-bf16
mlx-community/Qwen3-VL-30B-A3B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Granite-4.0-H-Tiny-4bit-DWQ
This model mlx-community/granite-4.0-h-Tiny-4bit-DWQ was converted to MLX format from ibm-granite/granite-4.0-h-small using mlx-lm version 0.28.2.
Llama-3.2-3B-Instruct
Qwen3-Next-80B-A3B-Instruct-4bit
parakeet-tdt_ctc-0.6b-ja
This model was converted to MLX format from nvidia/parakeet-tdtctc-0.6b-ja using the conversion script. Please refer to original model card for more details on the model.
Mistral-7B-Instruct-v0.2-4-bit
Qwen3-30B-A3B-4bit
Llama-3.2-3B-Instruct-uncensored-6bit
Kimi-K2-Instruct-0905-mlx-DQ3_K_M
This model mlx-community/Kimi-K2-Instruct-0905-mlx-DQ3KM was converted to MLX format from moonshotai/Kimi-K2-Instruct-0905 using mlx-lm version 0.26.3. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Kimi K2 does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window. You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like Should you wish to squeeze more out of your quant, and you do not need to use a larger context window, you can change the last part of the above code to
Qwen2.5-Coder-7B-Instruct-bf16
Mixtral-8x22B-4bit
Qwen3 VL 4B Instruct 8bit
Kimi-Linear-48B-A3B-Instruct-8bit
Llama-3.3-70B-Instruct-8bit
nvidia_Llama-3.1-Nemotron-70B-Instruct-HF_4bit
Huihui-GLM-4.5V-abliterated-mxfp4
mlx-community/Huihui-GLM-4.5V-abliterated-mxfp4 This model was converted to MLX format from [`huihui-ai/Huihui-GLM-4.5V-abliterated`]() using `mlx-vlm` with MXFP4 support. Refer to the original model card for more details on the model. Use with mlx
gemma-3-1b-pt-4bit
embeddinggemma-300m-8bit
DeepSeek-R1-Distill-Llama-70B-8bit
chandra-8bit
DeepSeek-Coder-V2-Lite-Instruct-8bit
embeddinggemma-300m-bf16
The Model mlx-community/embeddinggemma-300m-bf16 was converted to MLX format from google/embeddinggemma-300m using mlx-lm version 0.0.4.
Qwen3-0.6B-bf16
Qwen2.5-3B-Instruct-8bit
Nanonets-OCR2-3B-4bit
mlx-community/Nanonets-OCR2-3B-4bit This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Meta-Llama-3-8B-Instruct
Qwen3-VL-8B-Instruct-bf16
mlx-community/Qwen3-VL-8B-Instruct-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
GLM-4.5-Air-bf16
This model mlx-community/GLM-4.5-Air-bf16 was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.28.2.
Qwen3-VL-32B-Instruct-8bit
mlx-community/Qwen3-VL-32B-Instruct-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Ling-1T-mlx-3bit
This model mlx-community/Ling-1T-mlx-3bit/ was converted to MLX format from inclusionAI/Ling-1T using mlx-lm version 0.28.1. You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj
Llama-4-Scout-17B-16E-Instruct-4bit
deepseek-r1-distill-qwen-1.5b
Qwen2.5-VL-7B-Instruct-8bit
Apriel-1.5-15b-Thinker-4bit
mlx-community/Apriel-1.5-15b-Thinker-4bit This model was converted to MLX format from [`ServiceNow-AI/Apriel-1.5-15b-Thinker`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
SmolVLM-Instruct-4bit
dolphin-vision-72b-4bit
Codestral-22B-v0.1-4bit
gemma-3-270m-it-4bit
Qwen3-Embedding-0.6B-8bit
CodeLlama-70b-Instruct-hf-4bit-MLX
Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ
Nanonets-OCR2-3B-bf16
mlx-community/Nanonets-OCR2-3B-bf16 This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-7B-Instruct-Uncensored-4bit
gemma-3-1b-it-8bit
exaone-4.0-1.2b-4bit
LFM2-8B-A1B-8bit-MLX
Maintainer / Publisher: Susant Achary Upstream model: LiquidAI/LFM2-8B-A1B This repo (MLX 8-bit): `mlx-community/LFM2-8B-A1B-8bit-MLX` This repository provides an Apple-Silicon-optimized MLX build of LFM2-8B-A1B at 8-bit quantization for fast, on-device inference. - Architecture: Mixture-of-Experts (MoE) Transformer. - Size: ~8B total parameters with ~1B active per token (the “A1B” suffix commonly denotes ~1B active params). - Why MoE? During generation, only a subset of experts is activated per token, reducing compute per token while keeping a larger total parameter pool for expressivity. > Important memory note (single-device inference): > Although compute per token benefits from MoE (fewer active parameters), the full set of experts still resides in memory for typical single-GPU/CPU deployments. In practice this means RAM usage scales with total parameters, not with the smaller active count. - `config.json` (MLX), `mlxmodel.safetensors` (8-bit shards) - Tokenizer files: `tokenizer.json`, `tokenizerconfig.json` - Model metadata (e.g., `modelindex.json`) Target platform: macOS on Apple Silicon (M-series) using Metal/MPS. - General instruction-following, chat, and summarization - RAG back-ends and long-context workflows on device - Function-calling / structured outputs with schema-style prompts - Even at 8-bit, long contexts (KV-cache) can dominate memory at high `maxtokens` or large batch sizes. - As with any quantization, small regressions vs FP16 can appear on intricate math/code or edge-formatting. You asked to assume and decide RAM usage in absence of your measurements. Below are practical planning numbers derived from first-principles + experience with MLX and similar MoE models. Treat them as starting points and validate on your hardware. - Weights: `~ totalparams × 1 byte` (8-bit). For 8B params → ~8.0 GB baseline. - Runtime overhead: MLX graph + tensors + metadata → ~0.5–1.0 GB typical. - KV cache: grows with contextlength × layers × heads × dtype; often 1–3+ GB for long contexts. | Context window | Estimated peak RAM | |---|---:| | 4k tokens | ~9.5–10.5 GB | | 8k tokens | ~10.5–11.8 GB | | 16k tokens | ~12.0–14.0 GB | > These ranges assume 8-bit weights, A1B MoE (all experts resident), batch size = 1, and standard generation settings. > On lower windows (≤2k), you may see ~9–10 GB. Larger windows or batches will increase KV-cache and peak RAM. While this card is 8-bit, teams often want a consistent lineup. If you later produce 6/5/4/3/2-bit MLX builds, here’s a practical guide (RAM figures are indicative for an 8B MoE LM; your results depend on context/batch): | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to choose | |---|---:|:---:|---|---| | 4-bit | ~7–8 GB | 🔥🔥🔥 | Better detail retention | If 3-bit drops too much fidelity | | 6-bit | ~9–10.5 GB | 🔥🔥 | Near-max MLX quality | If you want accuracy under quant | | 8-bit (this repo) | ~9.5–12+ GB | 🔥🔥 | Highest quality among quant tiers | When RAM allows and you want the most faithful outputs | > MoE caveat: MoE reduces compute per token, but unless experts are paged/partitioned across devices and loaded on demand, memory still follows total parameters. On a single Mac, plan RAM as if the whole 8B parameter set is resident. Deterministic generation ```bash python -m mlxlm.generate \ --model mlx-community/LFM2-8B-A1B-8bit-MLX \ --prompt "Summarize the following in 5 bullet points:\n " \ --max-tokens 256 \ --temperature 0.0 \ --device mps \ --seed 0
gemma-3-12b-it-qat-abliterated-lm-4bit
FastVLM-0.5B-bf16
DeepSeek-R1-Distill-Qwen-32B-MLX-8Bit
Qwen3-8B-6bit
gemma-3-27b-it-4bit
Nanonets-OCR2-3B-8bit
mlx-community/Nanonets-OCR2-3B-8bit This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
GLM-4.5-Air-mxfp4
This model mlx-community/GLM-4.5-Air-mxfp4 was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.28.0.
SmolVLM2-256M-Video-Instruct-mlx
Qwen3-0.6B-4bit-DWQ-05092025
Dolphin-Mistral-24B-Venice-Edition-mlx-8Bit
LFM2-700M-8bit
Kimi-VL-A3B-Thinking-4bit
DeepSeek-R1-Distill-Llama-8B-4bit
Phi-3.5-vision-instruct-4bit
deepseek-vl2-8bit
Qwen3-30B-A3B-4bit-DWQ
DeepSeek-V3-4bit
Qwen3-VL-4B-Instruct-3bit
mlx-community/Qwen3-VL-4B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Meta-Llama-3.1-8B-Instruct-8bit
embeddinggemma-300m-4bit
DeepSeek-R1-Distill-Qwen-1.5B-3bit
whisper-tiny.en-mlx
Llama-3.2-8X4B-MOE-V2-Dark-Champion-Instruct-uncensored-abliterated-21B-Q_6-MLX
nomicai-modernbert-embed-base-4bit
GLM-4.5-4bit
Kimi-Linear-48B-A3B-Instruct-6bit
This model mlx-community/Kimi-Linear-48B-A3B-Instruct-6bit was converted to MLX format from moonshotai/Kimi-Linear-48B-A3B-Instruct using mlx-lm version 0.28.4.
Qwen2.5-Coder-7B-Instruct-4bit
Llama-4-Maverick-17B-16E-Instruct-4bit
phi-2-hf-4bit-mlx
Qwen3-VL-8B-Thinking-4bit
mlx-community/Qwen3-VL-8B-Thinking-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-0.5B-Instruct-8bit
granite-4.0-h-micro-8bit
This model mlx-community/granite-4.0-h-micro-8bit was converted to MLX format from ibm-granite/granite-4.0-h-micro using mlx-lm version 0.28.2.
Ling-1T-mlx-DQ3_K_M
This model mlx-community/Ling-1T-mlx-DQ3KM was converted to MLX format from inclusionAI/Ling-1T using mlx-lm version 0.28.1. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Ling 1T does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window. In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like
OlmOCR 2 7B 1025 Bf16
mlx-community/olmOCR-2-7B-1025-bf16 This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-R1-Distill-Qwen-14B-4bit
GLM-4-9B-0414-4bit
embeddinggemma-300m-qat-q4_0-unquantized-bf16
mlx-community/embeddinggemma-300m-qat-q40-unquantized-bf16 The Model mlx-community/embeddinggemma-300m-qat-q40-unquantized-bf16 was converted to MLX format from google/embeddinggemma-300m-qat-q40-unquantized using mlx-lm version 0.0.4.
GLM-Z1-9B-0414-4bit
gemma-3-12b-it-4bit
gemma-3-12b-it-bf16
DeepSeek-R1-Distill-Qwen-32B-abliterated-4bit
Qwen3-VL-32B-Instruct-4bit
mlx-community/Qwen3-VL-32B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
whisper-turbo
GLM-4-32B-0414-8bit
Apertus-8B-Instruct-2509-bf16
This model mlx-community/Apertus-8B-Instruct-2509-bf16 was converted to MLX format from swiss-ai/Apertus-8B-Instruct-2509 using mlx-lm version 0.27.0.
Qwen3-VL-8B-Thinking-bf16
mlx-community/Qwen3-VL-8B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
olmOCR-2-7B-1025-4bit
Qwen3-VL-4B-Thinking-bf16
mlx-community/Qwen3-VL-4B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Meta-Llama-3.1-8B-Instruct-bf16
granite-4.0-h-tiny-3bit-MLX
Granite-4.0-H-Tiny — MLX 3-bit (Apple Silicon) Maintainer / Publisher: Susant Achary This repository provides an Apple-Silicon-optimized MLX build of IBM Granite-4.0-H-Tiny with 3-bit weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows). Granite 4.0 is IBM’s latest hybrid Mamba-2/Transformer family with selective Mixture-of-Experts (MoE), designed for long-context, hyper-efficient inference and enterprise use. :contentReference[oaicite:0]{index=0} 🔎 What’s Granite 4.0? - Architecture. Hybrid Mamba-2 + softmax attention; H variants add MoE routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1} - Efficiency claims. Up to ~70% lower memory and ~2× faster inference vs. comparable models, especially for multi-session and long-context scenarios. :contentReference[oaicite:2]{index=2} - Context window. 128k tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3} - Licensing. Apache-2.0 for public/commercial use. :contentReference[oaicite:4]{index=4} > This MLX build targets Granite-4.0-H-Tiny (≈ 7B total, ≈ 1B active parameters). For reference, the family also includes H-Small (≈32B total / 9B active) and Micro/Micro-H (≈3B dense/hybrid) tiers. :contentReference[oaicite:5]{index=5} 📦 What’s in this repo (MLX format) - `config.json` (MLX), `mlxmodel.safetensors` (3-bit shards), tokenizer files, and processor metadata. - Ready for macOS on M-series chips via Metal/MPS. > The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: ibm-granite/granite-4.0-h-tiny. :contentReference[oaicite:6]{index=6} ✅ Intended use - General instruction-following and chat with long context (128k). :contentReference[oaicite:7]{index=7} - Enterprise assistant patterns (function calling, structured outputs) and RAG backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8} - On-device development on Macs (MLX), low-latency local prototyping and evaluation. ⚠️ Limitations - As a quantized, decoder-only LM, it can produce confident but wrong outputs—review for critical use. - 2–4-bit quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows. - Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9} 🧠 Model family at a glance | Tier | Arch | Params (total / active) | Notes | |---|---|---:|---| | H-Small | Hybrid + MoE | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} | | H-Tiny (this repo) | Hybrid + MoE | ~7B / 1B | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} | | Micro / H-Micro | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} | Context Window: up to 128k tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13} License: Apache-2.0. :contentReference[oaicite:14]{index=14} 🧪 Observed on-device behavior (MLX) Empirically on M-series Macs: - 3-bit often gives crisp, direct answers with good latency and modest RAM. - Higher bit-widths (4/5/6-bit) improve faithfulness on fine-grained tasks (tiny OCR, structured parsing), at higher memory cost. > Performance varies by Mac model, image/token lengths, and temperature; validate on your workload. 🔢 Choosing a quantization level (Apple Silicon) | Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose | |---|---:|:---:|---|---| | 2-bit | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests | | 3-bit (this build) | ~5–6 GB | 🔥🔥🔥🔥 | Direct, concise, great latency | Default for local dev on M1/M2/M3/M4 | | 4-bit | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness | | 5-bit | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs | | 6-bit | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample | > Figures are indicative for language-only Tiny (no vision), and will vary with context length and KV cache size. 🚀 Quickstart (CLI — MLX) ```bash Plain generation (deterministic) python -m mlxlm.generate \ --model \ --prompt "Summarize the following notes into 5 bullet points:\n " \ --max-tokens 200 \ --temperature 0.0 \ --device mps \ --seed 0
GLM-4-32B-0414-4bit
CodeLlama-13b-Instruct-hf-4bit-MLX
Nanonets-OCR-s-bf16
Qwen3-VL-32B-Thinking-4bit
mlx-community/Qwen3-VL-32B-Thinking-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen3-VL-2B-Instruct-3bit
mlx-community/Qwen3-VL-2B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
distil-whisper-large-v3
DeepSeek-R1-0528-4bit
GLM-4.5-Air-2bit
This model mlx-community/GLM-4.5-Air-2bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.1.
InternVL3_5-GPT-OSS-20B-A4B-Preview-4bit
mlx-community/InternVL35-GPT-OSS-20B-A4B-Preview-4bit This model was converted to MLX format from [`OpenGVLab/InternVL35-GPT-OSS-20B-A4B-Preview-HF`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
plamo-2-translate
Llama-3.2-11B-Vision-Instruct-4bit
Kokoro-82M-4bit
CodeLlama-7b-Python-4bit-MLX
gemma-3-12b-it-8bit
Qwen2.5-1.5B-Instruct-8bit
Qwen3-VL-4B-Thinking-4bit
mlx-community/Qwen3-VL-4B-Thinking-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-14B-Instruct-4bit
Mixtral-8x7B-Instruct-v0.1
parakeet-tdt-1.1b
Qwen3-Next-80B-A3B-Instruct-8bit
Llama-4-Scout-17B-16E-Instruct-8bit
Qwen3 VL 8B Thinking 6bit
mlx-community/Qwen3-VL-8B-Thinking-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen2-VL-7B-Instruct-4bit
gemma-3-27b-it-qat-8bit
DeepSeek-V3.1-8bit
GLM-4.5V-8bit
Hermes-3-Llama-3.1-8B-4bit
Qwen3-VL-32B-Thinking-bf16
parakeet-ctc-0.6b
Llama-4-Scout-17B-16E-Instruct-6bit
deepcogito-cogito-v1-preview-llama-8B-4bit
Qwen3-VL-30B-A3B-Instruct-6bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
mxbai-embed-large-v1
Llama-3-8B-Instruct-1048k-4bit
OpenELM-270M-Instruct
GLM-4.5-Air-8bit
This model mlx-community/GLM-4.5-Air-8bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.0.
Qwen3-VL-4B-Instruct-5bit
mlx-community/Qwen3-VL-4B-Instruct-5bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Mistral-7B-Instruct-v0.2
DeepSeek-R1-Distill-Llama-70B-4bit
Qwen3-VL-32B-Thinking-8bit
GLM-4.5-Air-3bit-DWQ-v2
Qwen3-VL-8B-Instruct-8bit
mlx-community/Qwen3-VL-8B-Instruct-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Nous-Hermes-2-Mixtral-8x7B-DPO-4bit
Phi-3-mini-128k-instruct-4bit
Qwen2.5-VL-72B-Instruct-4bit
Meta-Llama-3.1-405B-4bit
Qwen3-Next-80B-A3B-Thinking-4bit
This model mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit was converted to MLX format from Qwen/Qwen3-Next-80B-A3B-Thinking using mlx-lm version 0.27.1.
Jinx-gpt-oss-20b-mxfp4-mlx
This model mlx-community/Jinx-gpt-oss-20b-mxfp4-mlx was converted to MLX format from Jinx-org/Jinx-gpt-oss-20b-mxfp4 using mlx-lm version 0.27.1.
Llama-4-Scout-17B-16E-4bit
Qwen3-14B-4bit
NVIDIA-Nemotron-Nano-9B-v2-4bits
Kimi-K2-Instruct-0905-mlx-3bit
mlx-community/moonshotaiKimi-K2-Instruct-0905-mlx-3bit This model mlx-community/moonshotaiKimi-K2-Instruct-0905-mlx-3bit was converted to MLX format from moonshotai/Kimi-K2-Instruct-0905 using mlx-lm version 0.26.3.
Llama-3_3-Nemotron-Super-49B-v1_5-mlx-4Bit
mlx-community/Llama-33-Nemotron-Super-49B-v15-mlx-4Bit The Model mlx-community/Llama-33-Nemotron-Super-49B-v15-mlx-4Bit was converted to MLX format from unsloth/Llama-33-Nemotron-Super-49B-v15 using mlx-lm version 0.26.4.
gemma-2-27b-it-4bit
Qwen3-VL-30B-A3B-Instruct-3bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-Coder-V2-Lite-Instruct-4bit-AWQ
chandra-bf16
Qwen3-1.7B-MLX-MXFP4
This model mlx-community/Qwen3-1.7B-MLX-MXFP4 was converted to MLX format from Qwen/Qwen3-1.7B using mlx-lm version 0.28.3.
Phi-3-mini-4k-instruct-4bit-no-q-embed
gemma-3-27b-it-8bit
Qwen3-VL-30B-A3B-Thinking-6bit
mlx-community/Qwen3-VL-30B-A3B-Thinking-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
NousResearch_Hermes-4-14B-BF16-abliterated-mlx
gemma-3-4b-it-5bit
This model mlx-community/gemma-3-4b-it-5bit was converted to MLX format from google/gemma-3-4b-it using mlx-lm version 0.28.2.
Chandra 4bit
mlx-community/chandra-4bit This model was converted to MLX format from [`datalab-to/chandra`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
olmOCR-2-7B-1025-8bit
mlx-community/olmOCR-2-7B-1025-8bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Llama-3.1-Nemotron-70B-Instruct-HF-bf16
Qwen3-4B-6bit
Mistral-7B-Instruct-v0.2-4bit
Llama-3.2-90B-Vision-Instruct-4bit
GLM-4.5V-abliterated-4bit
mlx-community/GLM-4.5V-abliterated-4bit This model was converted to MLX format from [`huihui-ai/Huihui-GLM-4.5V-abliterated`]() using mlx-vlm. Refer to the original model card for more details on the model. Use with mlx
quantized-gemma-2b-it
Meta-Llama-3-70B-Instruct-4bit
olmOCR-2-7B-1025-mlx-8bit
mlx-community/olmOCR-2-7B-1025-mlx-8bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
TinyLlama-1.1B-Chat-v1.0-4bit
Unsloth-Phi-4-4bit
Qwen2.5-Coder-14B-Instruct-4bit
GLM-4.5V-abliterated-8bit
mlx-community/GLM-4.5V-abliterated-8bit This model was converted to MLX format from [`huihui-ai/Huihui-GLM-4.5V-abliterated`]() using mlx-vlm. Refer to the original model card for more details on the model. Use with mlx
jinaai-ReaderLM-v2
Apertus-8B-Instruct-2509-4bit
Meta-Llama-3.1-70B-Instruct-bf16-CORRECTED
Qwen3-VL-4B-Thinking-8bit
mlx-community/Qwen3-VL-4B-Thinking-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
paligemma-3b-mix-448-8bit
whisper-tiny-mlx
phi-4-4bit
llava-phi-3-mini-4bit
GLM-4.5-Air-3bit-DWQ
Qwen2.5-Coder-1.5B-Instruct-4bit
granite-4.0-h-1b-6bit
This model mlx-community/granite-4.0-h-1b-6bit was converted to MLX format from ibm-granite/granite-4.0-h-1b using mlx-lm version 0.28.4.
Qwen2.5-32B-Instruct-4bit
Mistral-Large-Instruct-2407-4bit
Apriel-1.5-15b-Thinker-8bit
Qwen3-14B-4bit-AWQ
DeepSeek-R1-Qwen3-0528-8B-4bit-AWQ
granite-4.0-h-1b-8bit
This model mlx-community/granite-4.0-h-1b-8bit was converted to MLX format from ibm-granite/granite-4.0-h-1b using mlx-lm version 0.28.4.
Qwen3-4B-Thinking-2507-fp16
granite-4.0-h-350m-8bit
Qwen2.5-Coder-32B-Instruct-4bit
Huihui-gemma-3n-E4B-it-abliterated-lm-8bit
Phi-3-vision-128k-instruct-4bit
Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX
Josiefied Qwen3 30B A3B Abliterated V2 4bit
AI21-Jamba-Reasoning-3B-4bit
This model mlx-community/AI21-Jamba-Reasoning-3B-4bit was converted to MLX format from ai21labs/AI21-Jamba-Reasoning-3B using mlx-lm version 0.28.2.
DeepSeek-Coder-V2-Instruct-AQ4_1
Josiefied-Qwen3-4B-Instruct-2507-abliterated-v1-8bit
Ministral-8B-Instruct-2410-4bit
Josiefied-Qwen3-8B-abliterated-v1-4bit
UTENA-7B-NSFW-V2-4bit
olmOCR-2-7B-1025-mlx-4bit
mlx-community/olmOCR-2-7B-1025-mlx-4bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
parakeet-tdt_ctc-1.1b
DeepSeek-Coder-V2-Lite-Instruct-4bit
SmolVLM2-2.2B-Instruct-mlx
Mistral-7B-v0.1-LoRA-Text2SQL
gemma-3n-E2B-it-lm-bf16
Kimi-Linear-48B-A3B-Instruct-3bit
This model mlx-community/Kimi-Linear-48B-A3B-Instruct-3bit was converted to MLX format from moonshotai/Kimi-Linear-48B-A3B-Instruct using mlx-lm version 0.28.4.
csm-1b
Llama-4-Maverick-17B-16E-Instruct-6bit
SmolLM-135M-4bit
DeepSeek-V3.1-mlx-DQ5_K_M
This model mlx-community/DeepSeek-V3.1-mlx-DQ5KM was converted to MLX format from deepseek-ai/DeepSeek-V3.1 using mlx-lm version 0.26.3. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. With 512 GB, we can do better than the 4-bit version of DeepSeek V3.1. Using research results, we aim to get better than 5-bit performance using smarter quantization. We aim to not have the quant so large that it leaves no memory for a useful context window. The temperature of 1.3 is DeepSeek's recommendation for translations. For coding, you should probably use a temperature of 0.6 or lower. In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In this case we did not want a improved 3-bit quant, but rather the best possible "5-bit" quant. We therefore modified the `DQ3KM` quantization by replacing 3-bit by 5-bit, 4-bit by 6-bit, and 6-bit by 8-bit to create a new `DQ5KM` quant. This produces a quantization of 5.638 bpw (bits per weight). In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like Should you wish to squeeze more out of your quant, and you do not need to use a larger context window, you can change the last part of the above code to
Ring-flash-linear-2.0-128k-4bit
This model mlx-community/Ring-flash-linear-2.0-128k-4bit was converted to MLX format from inclusionAI/Ring-flash-linear-2.0-128k using mlx-lm version 0.28.2.
Qwen3-Coder-30B-A3B-Instruct-3bit
This model mlx-community/Qwen3-Coder-30B-A3B-Instruct-3bit was converted to MLX format from Qwen/Qwen3-Coder-30B-A3B-Instruct using mlx-lm version 0.26.1.
whisper-large-v3-mlx-8bit
Qwen3-30B-A3B-bf16
Qwen3-30B-A3B-Instruct-2507-6bit
meta-llama-Llama-4-Scout-17B-16E-4bit
Qwen3-235B-A22B-Thinking-2507-3bit-DWQ
mlx-community/Qwen3-235B-A22B-Thinking-2507-3bit-DWQ This model mlx-community/Qwen3-235B-A22B-Thinking-2507-3bit-DWQ was converted to MLX format from Qwen/Qwen3-235B-A22B-Thinking-2507 using mlx-lm version 0.26.0.
DeepSeek-R1-Distill-Qwen-14B-8bit
gemma-3-27b-it-qat-bf16
GLM-4.5-Air-2bit-DWQ
This model mlx-community/GLM-4.5-Air-2bit-DWQ was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.2.
GLM-4-9B-0414-8bit
DeepSeek-V3.1-Base-4bit
deepseek-coder-33b-instruct-hf-4bit-mlx
Qwen3-VL-30B-A3B-Instruct-5bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-5bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Qwen3-Next-80B-A3B-Thinking-8bit
moonshotai_Kimi-K2-Instruct-mlx-3bit
This model mlx-community/moonshotaiKimi-K2-Instruct-mlx-3bit was converted to MLX format from moonshotai/Kimi-K2-Instruct using mlx-lm version 0.26.3.
UserLM-8b-8bit
Qwen2.5-7B-Instruct-1M-4bit
Llama-3.1-8B-Instruct
Llama-4-Maverick-17B-128E-Instruct-4bit
Apriel 1.5 15b Thinker 6bit MLX
Apriel-1.5-15B-Thinker — MLX Quantized (Apple Silicon) Format: MLX (Apple Silicon) Variants: 6-bit (recommended) Base model: ServiceNow-AI/Apriel-1.5-15B-Thinker Architecture: Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder) Intended use: image understanding & grounded reasoning; document/chart/OCR-style tasks; math/coding Q&A with visual context. > This repository provides MLX-format weights for Apple Silicon (M-series) built from the original Apriel-1.5-15B-Thinker release. It is optimized for on-device inference with small memory footprints and fast startup on macOS. Apriel-1.5-15B-Thinker is a 15B open-weights multimodal reasoning model trained via a data-centric mid-training recipe rather than RLHF/RM. Starting from Pixtral-12B as the base, the authors apply: 1) Depth Upscaling (capacity expansion without pretraining from scratch), 2) Two-stage multimodal continual pretraining (CPT) to build text + visual reasoning, and 3) High-quality SFT with explicit reasoning traces across math, coding, science, and tool use. This approach delivers frontier-level capability on compact compute. :contentReference[oaicite:0]{index=0} Key reported results (original model) - AAI Index: 52, matching DeepSeek-R1-0528 at far lower compute. :contentReference[oaicite:1]{index=1} - Multimodal: On 10 image benchmarks, within ~5 points of Gemini-2.5-Flash and Claude Sonnet-3.7 on average. :contentReference[oaicite:2]{index=2} - Designed for single-GPU / constrained deployment scenarios. :contentReference[oaicite:3]{index=3} > Notes above summarize the upstream paper; MLX quantization can slightly affect absolute scores. Always validate on your use case. - Backbone: Pixtral-12B-Base-2409 adapted to a larger 15B decoder via depth upscaling (layers 40 → 48), then re-aligned with a 2-layer projection network connecting the vision encoder and decoder. :contentReference[oaicite:4]{index=4} - Training stack: - CPT Stage-1: mixed tokens (≈50% text, 20% replay, 30% multimodal) for foundational reasoning & image understanding; 32k context; cosine LR with warmup; all components unfrozen; checkpoint averaging. :contentReference[oaicite:5]{index=5} - CPT Stage-2: targeted synthetic visual tasks (reconstruction, visual matching, detection, counting) to strengthen spatial/compositional/fine-grained reasoning; vision encoder frozen; loss on responses for instruct data; 16k context. :contentReference[oaicite:6]{index=6} - SFT: curated instruction-response pairs with explicit reasoning traces (math, coding, science, tools). :contentReference[oaicite:7]{index=7} - Why MLX? Native Apple-Silicon inference with small binaries, fast load, and low memory overhead. - What’s included: `config.json`, `mlxmodel.safetensors` (sharded), tokenizer & processor files, and metadata for VLM pipelines. - Quantization options: - 6-bit (recommended): best balance of quality & memory. > Tip: If you’re capacity-constrained on an M1/M2, try 6-bit first; ```bash Basic image caption python -m mlxvlm.generate \ --model \ --image /path/to/image.jpg \ --prompt "Describe this image." \ --max-tokens 128 --temperature 0.0 --device mps
DeepSeek-R1-0528-Qwen3-8B-4bit-DWQ
all-MiniLM-L6-v2-4bit
InternVL3_5-30B-A3B-4bit
mlx-community/InternVL35-30B-A3B-4bit This model was converted to MLX format from [`OpenGVLab/InternVL35-30B-A3B-HF`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
mistral-7B-v0.1
LFM2-8B-A1B-8bit
Qwen3-VL-30B-A3B-Thinking-5bit
DeepSeek-R1-Distill-Qwen-14B-6bit
Codestral-22B-v0.1-8bit
GLM-Z1-32B-0414-4bit
Qwen3-Coder-30B-A3B-Instruct-8bit-DWQ-lr9e8
bge-small-en-v1.5-4bit
DeepSeek-R1-3bit
Nanonets-OCR2-3B-6bit
mlx-community/Nanonets-OCR2-3B-6bit This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-v3-0324-8bit
Ring-1T-mlx-DQ3_K_M
This model mlx-community/Ring-1T-mlx-DQ3KM was converted to MLX format from inclusionAI/Ring-1T using mlx-lm version 0.28.1. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Ring 1T does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window. In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like
olmOCR-2-7B-1025-5bit
mlx-community/olmOCR-2-7B-1025-5bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-R1-Distill-Qwen-7B-8bit
plamo-2-1b
Llama-3.2-3B-Instruct-abliterated-6bit
embeddinggemma-300m-qat-q8_0-unquantized-bf16
mlx-community/embeddinggemma-300m-qat-q80-unquantized-bf16 The Model mlx-community/embeddinggemma-300m-qat-q80-unquantized-bf16 was converted to MLX format from google/embeddinggemma-300m-qat-q80-unquantized using mlx-lm version 0.0.4.
Qwen3-4B-Instruct-2507-8bit
This model mlx-community/Qwen3-4B-Instruct-2507-8bit was converted to MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.26.2.
Llama-3.3-70B-Instruct-bf16
Qwen3-VL-32B-Instruct-bf16
mlx-community/Qwen3-VL-32B-Instruct-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
codegemma-7b-it-8bit
Llama-3.1-8B-Instruct-4bit
Qwen3-Next-80B-A3B-Instruct-5bit
granite-3.3-8b-instruct-4bit
Qwen3-8B-4bit-DWQ-053125
c4ai-command-r-plus-4bit
Qwen2.5-72B-Instruct-4bit
gemma-3-27b-it-4bit-DWQ
dolphin-2.9-llama3-70b-4bit
Mistral-Small-24B-Instruct-2501-4bit
llava-v1.6-mistral-7b-4bit
gemma-3-1b-it-bf16
dac-speech-24khz-1.5kbps
Llama-OuteTTS-1.0-1B-4bit
LongCat-Flash-Chat-4bit
granite-4.0-h-1b-base-8bit
This model mlx-community/granite-4.0-h-1b-base-8bit was converted to MLX format from ibm-granite/granite-4.0-h-1b-base using mlx-lm version 0.28.4.
Llama-3.3-70B-Instruct-3bit
deepseek-coder-33b-instruct
bitnet-b1.58-2B-4T-4bit
Kimi-Linear-48B-A3B-Instruct-5bit
This model mlx-community/Kimi-Linear-48B-A3B-Instruct-5bit was converted to MLX format from moonshotai/Kimi-Linear-48B-A3B-Instruct using mlx-lm version 0.28.4.
MinerU2.5-2509-1.2B-bf16
mlx-community/MinerU2.5-2509-1.2B-bf16 This model was converted to MLX format from [`opendatalab/MinerU2.5-2509-1.2B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Mistral-Small-3.1-24B-Instruct-2503-4bit
Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx
Llama-3.1-Nemotron-Nano-4B-v1.1-4bit
Apriel-1.5-15b-Thinker-bf16
mlx-community/Apriel-1.5-15b-Thinker-bf16 This model was converted to MLX format from [`ServiceNow-AI/Apriel-1.5-15b-Thinker`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Qwen3-30B-A3B-Thinking-2507-4bit
This model mlx-community/Qwen3-30B-A3B-Thinking-2507-4bit was converted to MLX format from Qwen/Qwen3-30B-A3B-Thinking-2507 using mlx-lm version 0.26.3.
LFM2-8B-A1B-fp16
Qwen2.5-VL-32B-Instruct-4bit
Qwen3-14B-4bit-DWQ-053125
meta-llama-Llama-4-Scout-17B-16E-fp16
gemma-3-4b-it-bf16
deepseek-coder-6.7b-instruct-hf-4bit-mlx
gemma-3-1b-it-4bit-DWQ
gemma-3n-E4B-it-bf16
LongCat-Flash-Chat-mlx-DQ6_K_M
gemma-3-270m-it-bf16
whisper-medium-mlx-4bit
Qwen3-14B-6bit
gpt2-base-mlx
LFM2-VL-450M-8bit
starcoder2-7b-4bit
Ling-mini-2.0-4bit
This model mlx-community/Ling-mini-2.0-4bit was converted to MLX format from inclusionAI/Ling-mini-2.0 using mlx-lm version 0.27.1.
LLaDA2.0-mini-preview-4bit
This model mlx-community/LLaDA2.0-mini-preview-4bit was converted to MLX format from inclusionAI/LLaDA2.0-mini-preview using mlx-lm version 0.28.4.
Qwen3-4B-4bit-DWQ-053125
Dolphin-Mistral-24B-Venice-Edition-4bit
Llama-3-8B-Instruct-1048k-8bit
conikeec-deepseek-coder-6.7b-instruct
Josiefied-DeepSeek-R1-0528-Qwen3-8B-abliterated-v1-4bit
Apertus-8B-Instruct-2509-8bit
Gemma-3-Glitter-12B-8bit
gemma-3-12b-it-4bit-DWQ
Gabliterated-Qwen3-0.6B-4bit
gemma-3-270m-4bit
Qwen3-VL-2B-Thinking-bf16
mlx-community/Qwen3-VL-2B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
gemma-2-27b-bf16
Qwen3-VL-4B-Instruct-6bit
mlx-community/Qwen3-VL-4B-Instruct-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Mistral-7B-Instruct-v0.3-8bit
LFM2-VL-3B-8bit
nomicai-modernbert-embed-base-bf16
bitnet-b1.58-2B-4T-8bit
Qwen3-Coder-30B-A3B-Instruct-bf16
This model mlx-community/Qwen3-Coder-30B-A3B-Instruct-bf16 was converted to MLX format from Qwen/Qwen3-Coder-30B-A3B-Instruct using mlx-lm version 0.26.2.
LFM2-8B-A1B-6bit
This model mlx-community/LFM2-8B-A1B-6bit was converted to MLX format from LiquidAI/LFM2-8B-A1B using mlx-lm version 0.28.2.
gemma-3n-E4B-it-lm-bf16
Qwen2.5-Coder-1.5B-4bit
gemma-3-270m-it-qat-4bit
This model mlx-community/gemma-3-270m-it-qat-4bit was converted to MLX format from google/gemma-3-270m-it-qat using mlx-lm version 0.26.3.
DeepSeek-R1-Distill-Qwen-1.5B-6bit
medgemma-27b-it-8bit
gemma-3-27b-it-bf16
orpheus-3b-0.1-ft-4bit
meta-llama-Llama-4-Scout-17B-16E-Instruct-bf16
c4ai-command-r-v01-4bit
Llama-3.2-8X4B-MOE-V2-Dark-Champion-Instruct-uncensored-abliterated-21B-MLX
Qwen3-1.7B-4bit-DWQ-053125
Qwen3-4B-Instruct-2507-5bit
LFM2-8B-A1B-6bit-MLX
Maintainer / Publisher: Susant Achary Upstream model: LiquidAI/LFM2-8B-A1B This repo (MLX 6-bit): `mlx-community/LFM2-8B-A1B-6bit-MLX` This repository provides an Apple-Silicon-optimized MLX build of LFM2-8B-A1B at 6-bit quantization. Among quantized tiers, 6-bit is a strong fidelity sweet-spot for many Macs—noticeably smaller than FP16/8-bit while preserving answer quality for instruction following, summarization, and structured extraction. - Architecture: Mixture-of-Experts (MoE) Transformer. - Size: ~8B total parameters with ~1B active per token (A1B ≈ “~1B active”). - Why MoE? At each token, a subset of experts is activated, reducing compute per token while keeping a larger parameter pool for expressivity. > Single-device memory reality: Even though only ~1B are active per token, all experts typically reside in memory during inference on one device. That means RAM planning should track total parameters, not just the active slice. - `config.json` (MLX), `mlxmodel.safetensors` (6-bit shards) - Tokenizer files: `tokenizer.json`, `tokenizerconfig.json` - Model metadata (e.g., `modelindex.json`) Target: macOS on Apple Silicon (M-series) with Metal/MPS. - General instruction following, chat, and summarization - RAG and long-context assistants on device - Schema-guided structured outputs (JSON) - Quantization can cause small regressions vs FP16 on tricky math/code or tight formatting. - For very long contexts and/or batching, the KV-cache can dominate memory—tune `maxtokens` and batch size. - Add your own safety/guardrails for sensitive deployments. You asked to assume and decide realistic ranges. The following are practical starting points for a single-device MLX run; validate on your hardware. Rule-of-thumb components - Weights (6-bit): ≈ `totalparams × 0.75 byte` → for 8B params ≈ ~6.0 GB
Josiefied-Qwen2.5-7B-Instruct-abliterated-v2
deepseek-coder-1.3b-instruct-mlx
Qwen2.5-Coder-32B-Instruct-8bit
Qwen2.5-VL-3B-Instruct-bf16
gemma-3-4b-it-4bit-DWQ
Qwen3-1.7B-8bit
Huihui-gemma-3n-E4B-it-abliterated-lm-6bit
mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-6bit The Model mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-6bit was converted to MLX format from huihui-ai/Huihui-gemma-3n-E4B-it-abliterated using mlx-lm version 0.26.4.
Qwen3-VL-2B-Instruct-8bit
mlx-community/Qwen3-VL-2B-Instruct-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
GLM-4-32B-0414-4bit-DWQ
granite-4.0-h-tiny-5bit-MLX
Josiefied-Qwen3-30B-A3B-abliterated-v2-8bit
Huihui-gemma-3n-E4B-it-abliterated-lm-4bit
mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-4bit The Model mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-4bit was converted to MLX format from huihui-ai/Huihui-gemma-3n-E4B-it-abliterated using mlx-lm version 0.26.4.
CodeLlama-7b-mlx
Qwen3-0.6B-4bit-AWQ
Josiefied-Qwen3-8B-abliterated-v1-8bit
Llama-4-Scout-17B-16E-8bit
Qwen3-4B-Instruct-2507-4bit-g32
This model mlx-community/Qwen3-4B-Instruct-2507-4bit-g32 was converted to MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.28.2.
Qwen3-VL-8B-Thinking-5bit
mlx-community/Qwen3-VL-8B-Thinking-5bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx