mlx-community
✓ VerifiedCommunityApple MLX framework community contributions
gpt-oss-20b-MXFP4-Q8
--- license: apache-2.0 pipeline_tag: text-generation library_name: mlx tags: - vllm - mlx base_model: openai/gpt-oss-20b ---
parakeet-tdt-0.6b-v3
--- library_name: mlx language: - en - es - fr - de - bg - hr - cs - da - nl - et - fi - el - hu - it - lv - lt - mt - pl - pt - ro - sk - sl - sv - ru - uk tags: - mlx - automatic-speech-recognition - speech - audio - FastConformer - Conformer - Parakeet license: cc-by-4.0 pipeline_tag: automatic-speech-recognition base_model: nvidia/parakeet-tdt-0.6b-v3 ---
parakeet-tdt-0.6b-v2
--- library_name: mlx tags: - mlx - automatic-speech-recognition - speech - audio - FastConformer - Conformer - Parakeet license: cc-by-4.0 pipeline_tag: automatic-speech-recognition base_model: nvidia/parakeet-tdt-0.6b-v2 ---
whisper-small-mlx
gemma-3-12b-it-qat-4bit
gemma-3-27b-it-qat-4bit
mlx-community/gemma-3-27b-it-qat-4bit This model was converted to MLX format from [`google/gemma-3-27b-it-qat-q40-unquantized`]() using mlx-vlm version 0.1.23. Refer to the original model card for more details on the model. Use with mlx
gemma-3-4b-it-qat-4bit
Qwen3-30B-A3B-Instruct-2507-4bit
This model mlx-community/Qwen3-30B-A3B-Instruct-2507-4bit was converted to MLX format from Qwen/Qwen3-30B-A3B-Instruct-2507 using mlx-lm version 0.26.3.
gemma-3-1b-it-qat-4bit
The Model mlx-community/gemma-3-1b-it-qat-4bit was converted to MLX format from google/gemma-3-1b-it-qat-q40 using mlx-lm version 0.22.5.
Llama-3.2-1B-Instruct-4bit
Llama-3.2-3B-Instruct-4bit
gemma-2-2b-it-4bit
whisper-large-v3-mlx
Qwen3-1.7B-4bit
Meta-Llama-3.1-8B-Instruct-4bit
Qwen3-VL-4B-Instruct-4bit
DeepSeek-OCR-8bit
mlx-community/DeepSeek-OCR-8bit This model was converted to MLX format from [`deepseek-ai/DeepSeek-OCR`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
Qwen3-Embedding-0.6B-4bit-DWQ
Mistral-7B-Instruct-v0.3-4bit
gemma-3-1b-it-4bit
Kokoro-82M-bf16
mlx-community/Kokoro-82M-bf16 This model was converted to MLX format from [`hexagrad/Kokoro-82M`]() using mlx-audio version 0.0.1. Refer to the original model card for more details on the model. Use with mlx
gemma-3n-E4B-it-lm-4bit
gemma-3n-E2B-it-lm-4bit
Qwen3-4B-4bit
Qwen3-0.6B-4bit
This model mlx-community/Qwen3-0.6B-4bit was converted to MLX format from Qwen/Qwen3-0.6B using mlx-lm version 0.24.0.
MiniMax-M2-4bit
This model mlx-community/MiniMax-M2-4bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
Qwen2-VL-2B-Instruct-4bit
whisper-large-v3-turbo
whisper-large-v3-turbo This model was converted to MLX format from [`large-v3-turbo`]().
Qwen2.5-3B-Instruct-4bit
DeepSeek-V3.2-4bit
bge-small-en-v1.5-bf16
Qwen3-VL-2B-Instruct-4bit
mlx-community/Qwen3-VL-2B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen3-235B-A22B-4bit
Dolphin3.0-Llama3.1-8B-4bit
Qwen3-4B-Instruct-2507-4bit
DeepSeek-OCR-4bit
MiniMax-M2.1-4bit
SmolLM3-3B-4bit
Qwen2.5-0.5B-Instruct-4bit
Qwen3-235B-A22B-8bit
Qwen1.5-0.5B-Chat-4bit
DeepSeek-R1-Distill-Qwen-1.5B-8bit
Qwen3-4B-Thinking-2507-4bit
GLM-4.6-4bit
This model mlx-community/GLM-4.6-4bit was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.1.
DeepSeek-OCR-5bit
mlx-community/DeepSeek-OCR-5bit This model was converted to MLX format from [`deepseek-ai/DeepSeek-OCR`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-1.5B-Instruct-4bit
MiniMax-M2-3bit
DeepSeek-R1-Distill-Qwen-1.5B-4bit
GLM-4.7-4bit
whisper-medium-mlx
granite-4.0-h-micro-4bit
MiniMax-M2-8bit
This model mlx-community/MiniMax-M2-8bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
Qwen3-8B-4bit
Llama-3.2-3B-Instruct-8bit
Meta-Llama-3-8B-Instruct-4bit
mlx-community/Meta-Llama-3-8B-Instruct-4bit This model was converted to MLX format from [`meta-llama/Meta-Llama-3-8B-Instruct`]() using mlx-lm version 0.9.0. Refer to the original model card for more details on the model. Use with mlx
SmolVLM2-500M-Video-Instruct-mlx
Qwen3-Embedding-4B-4bit-DWQ
Josiefied-Qwen3-4B-abliterated-v1-4bit
mlx-community/Josiefied-Qwen3-4B-abliterated-v1-4bit This model mlx-community/Josiefied-Qwen3-4B-abliterated-v1-4bit was converted to MLX format from Goekdeniz-Guelmez/Josiefied-Qwen3-4B-abliterated-v1 using mlx-lm version 0.24.0.
Qwen2.5-7B-Instruct-4bit
Phi-4-mini-instruct-4bit
phi-2
MiniMax-M2-mlx-8bit-gs32
This model mlx-community/MiniMax-M2-mlx-8bit-gs32 was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.1. Recipe: 8-bit group-size 32 9 bits per weight (bpw) You can find more similar MLX model quants for a single Apple Mac Studio M3 Ultra with 512 GB at https://huggingface.co/bibproj
granite-3.3-2b-instruct-4bit
Llama-3.3-70B-Instruct-4bit
GLM-4.6-mlx-8bit-gs32
This model mlx-community/GLM-4.6-mlx-8bit-gs32 was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.1. Recipe: 8-bit group-size 32 9 bits per weight (bpw) You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj
Qwen3-VL-235B-A22B-Instruct-3bit
mlx-community/Qwen3-VL-235B-A22B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-235B-A22B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Llama-2-7b-chat-mlx
granite-4.0-h-tiny-4bit
DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx
Qwen2.5-VL-3B-Instruct-4bit
mlx-community/Qwen2.5-VL-3B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen2.5-VL-3B-Instruct`]() using mlx-vlm version 0.1.11. Refer to the original model card for more details on the model. Use with mlx
gemma-3-270m-it-8bit
This model mlx-community/gemma-3-270m-it-8bit was converted to MLX format from google/gemma-3-270m using mlx-lm version 0.26.3.
NVIDIA-Nemotron-3-Nano-30B-A3B-4bit
whisper-base-mlx
Qwen3-Embedding-8B-4bit-DWQ
This model mlx-community/Qwen3-Embedding-8B-4bit-DWQ was converted to MLX format from Qwen/Qwen3-Embedding-8B using mlx-lm version 0.25.1.
parakeet-rnnt-0.6b
gpt-oss-120b-MXFP4-Q4
This model mlx-community/gpt-oss-120b-MXFP4-Q4 was converted to MLX format from openai/gpt-oss-120b using mlx-lm version 0.27.0.
Phi-3.5-mini-instruct-4bit
IQuest-Coder-V1-40B-Loop-Instruct-4bit
LFM2-2.6B-4bit
This model mlx-community/LFM2-2.6B-4bit was converted to MLX format from LiquidAI/LFM2-2.6B using mlx-lm version 0.28.0.
LFM2-8B-A1B-4bit
gpt-oss-20b-MXFP4-Q4
GLM-4.6-bf16
This model mlx-community/GLM-4.6-bf16 was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.2.
Qwen3-0.6B-8bit
LFM2-1.2B-4bit
This model mlx-community/LFM2-1.2B-4bit was converted to MLX format from LiquidAI/LFM2-1.2B using mlx-lm version 0.26.0.
whisper-large-v2-mlx
gemma-3n-E4B-it-4bit
Qwen3-VL-30B-A3B-Instruct-4bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Dolphin3.0-Llama3.1-8B-8bit
MiniMax M2 6bit
This model mlx-community/MiniMax-M2-6bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
deepcogito-cogito-v1-preview-llama-3B-4bit
Hermes-3-Llama-3.2-3B-4bit
DeepSeek-R1-Distill-Qwen-32B-4bit
Qwen3-VL-30B-A3B-Instruct-8bit
dolphin3.0-llama3.2-3B-4Bit
Phi-4-mini-instruct-8bit
The Model mlx-community/Phi-4-mini-instruct-8bit was converted to MLX format from microsoft/Phi-4-mini-instruct using mlx-lm version 0.21.5.
GLM-4.5-Air-3bit
This model mlx-community/GLM-4.5-Air-3bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.1.
DeepSeek-V3-0324-4bit
This model mlx-community/DeepSeek-v3-0324-4bit was converted to MLX format from deepseek-ai/DeepSeek-v3-0324-4bit using mlx-lm version 0.26.3.
GLM-4.6-5bit
This model mlx-community/GLM-4.6-5bit was converted to MLX format from zai-org/GLM-4.6 using mlx-lm version 0.28.1.
gemma-2-9b-it-4bit
gpt-oss-120b-MXFP4-Q8
This model mlx-community/gpt-oss-120b-MXFP4-Q8 was converted to MLX format from openai/gpt-oss-120b using mlx-lm version 0.27.0.
DeepSeek-OCR-6bit
mlx-community/DeepSeek-OCR-6bit This model was converted to MLX format from [`deepseek-ai/DeepSeek-OCR`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
Josiefied-Qwen3-1.7B-abliterated-v1-4bit
DeepSeek-R1-0528-Qwen3-8B-4bit
This model mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit was converted to MLX format from deepseek-ai/DeepSeek-R1-0528-Qwen3-8B using mlx-lm version 0.24.1.
Gemma 3 4b It 4bit
mlx-community/gemma-3-4b-it-4bit This model was converted to MLX format from [`google/gemma-3-4b-it`]() using mlx-vlm version 0.1.18. Refer to the original model card for more details on the model. Use with mlx
SmolLM-135M-Instruct-4bit
whisper-tiny
Mistral-Nemo-Instruct-2407-4bit
LFM2-8B-A1B-3bit-MLX
Maintainer / Publisher: Susant Achary Upstream model: LiquidAI/LFM2-8B-A1B This repo (MLX 3-bit): `mlx-community/LFM2-8B-A1B-3bit-MLX` This repository provides an Apple-Silicon-optimized MLX build of LFM2-8B-A1B at 3-bit quantization. 3-bit is an excellent size↔quality sweet spot on many Macs—very small memory footprint with surprisingly solid answer quality and snappy decoding. - Architecture: Mixture-of-Experts (MoE) Transformer. - Size: ~8B total parameters with ~1B active per token (the “A1B” naming commonly indicates ~1B active params). - Why MoE? Per token, only a subset of experts is activated → lower compute per token while retaining a larger parameter pool for expressivity. > Memory reality on a single device: Even though ~1B parameters are active at a time, all experts typically reside in memory in single-device runs. Plan RAM based on total parameters, not just the active slice. - `config.json` (MLX), `mlxmodel.safetensors` (3-bit shards) - Tokenizer: `tokenizer.json`, `tokenizerconfig.json` - Metadata: `modelindex.json` (and/or processor metadata as applicable) Target: macOS on Apple Silicon (M-series) using Metal/MPS. - General instruction following, chat, and summarization - RAG back-ends and long-context assistants on device - Schema-guided structured outputs (JSON) where low RAM is a priority - 3-bit is lossy: tiny improvements in latency/RAM come with some accuracy trade-off vs 6/8-bit. - For very long contexts and/or batching, KV-cache can dominate memory—tune `maxtokens` and batch size. - Add your own guardrails/safety for production deployments. You asked to assume and decide realistic ranges. The numbers below are practical starting points—verify on your machine. - Weights (3-bit): ≈ `totalparams × 0.375 byte` → for 8B params ≈ ~3.0 GB - Runtime overhead: MLX graph/tensors/metadata → ~0.6–1.0 GB - KV-cache: grows with context × layers × heads × dtype → ~0.8–2.5+ GB | Context window | Estimated peak RAM | |---|---:| | 4k tokens | ~4.4–5.5 GB | | 8k tokens | ~5.2–6.6 GB | | 16k tokens | ~6.5–8.8 GB | > For ≤2k windows you may see ~4.0–4.8 GB. Larger windows/batches increase KV-cache and peak RAM. 🧭 Precision choices for LFM2-8B-A1B (lineup planning) While this card is 3-bit, teams often publish multiple precisions. Use this table as a planning guide (8B MoE LM; actuals depend on context/batch/prompts): | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to choose | |---|---:|:---:|---|---| | 3-bit (this repo) | ~4.4–8.8 GB | 🔥🔥🔥🔥 | Direct, concise, great latency | Default on 8–16 GB Macs | | 6-bit | ~7.5–12.5 GB | 🔥🔥 | Best quality under quant | Choose if RAM allows | | 8-bit | ~9.5–12+ GB | 🔥🔥 | Largest quantized size / highest fidelity | When you prefer simpler 8-bit workflows | > MoE caveat: MoE lowers compute per token; unless experts are paged/partitioned, memory still scales with total parameters on a single device. Deterministic generation ```bash python -m mlxlm.generate \ --model mlx-community/LFM2-8B-A1B-3bit-MLX \ --prompt "Summarize the following in 5 concise bullet points:\n " \ --max-tokens 256 \ --temperature 0.0 \ --device mps \ --seed 0
MiniMax M2 5bit
This model mlx-community/MiniMax-M2-5bit was converted to MLX format from MiniMaxAI/MiniMax-M2 using mlx-lm version 0.28.4.
gemma-3-text-4b-it-4bit
The Model mlx-community/gemma-3-text-4b-it-4bit was converted to MLX format from mlx-community/gemma-3-4b-it-bf16 using mlx-lm version 0.22.0.
Qwen3-Coder-30B-A3B-Instruct-4bit
granite-4.0-micro-8bit
This model mlx-community/granite-4.0-micro-8bit was converted to MLX format from ibm-granite/granite-4.0-micro using mlx-lm version 0.28.2.
Phi-3-mini-4k-instruct-4bit
Qwen3-VL-235B-A22B-Thinking-3bit
MiniMax-M2.1-6bit
Llama-3.2-11B-Vision-Instruct-abliterated
Kimi-Dev-72B-4bit-DWQ
This model mlx-community/Kimi-Dev-72B-4bit-DWQ was converted to MLX format from moonshotai/Kimi-Dev-72B using mlx-lm version 0.26.0.
Qwen3-VL-30B-A3B-Instruct-bf16
mlx-community/Qwen3-VL-30B-A3B-Instruct-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
whisper-small-mlx-8bit
Llama-3.2-1B-Instruct-8bit
DeepSeek-R1-Distill-Qwen-7B-4bit
GLM-4.5-Air-4bit
This model mlx-community/GLM-4.5-Air-4bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.0.
DeepSeek-V3.2-mlx-5bit
Qwen2.5-VL-7B-Instruct-4bit
LFM2-350M-8bit
This model mlx-community/LFM2-350M-8bit was converted to MLX format from LiquidAI/LFM2-350M using mlx-lm version 0.26.0.
DeepSeek-R1-4bit
Meta-Llama-3.1-70B-Instruct-4bit
Kimi-K2-Instruct-4bit
phi-4-8bit
The Model mlx-community/phi-4-8bit was converted to MLX format from microsoft/phi-4 using mlx-lm version 0.20.6.
3b-de-ft-research_release-4bit
Devstral-Small-2-24B-Instruct-2512-4bit
whisper-large-mlx
dolphin3.0-llama3.2-1B-4Bit
The Model mlx-community/dolphin3.0-llama3.2-1B-4Bit was converted to MLX format from dphn/Dolphin3.0-Llama3.2-1B using mlx-lm version 0.26.3.
Qwen3-VL-8B-Instruct-4bit
mlx-community/Qwen3-VL-8B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Llama-3.2-11B-Vision-Instruct-8bit
Qwen3-4B-8bit
gemma-3-4b-it-8bit
gemma-3n-E2B-it-4bit
DeepSeek-V3.1-4bit
Qwen3-VL-8B-Thinking-8bit
mlx-community/Qwen3-VL-8B-Thinking-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Ring-mini-linear-2.0-4bit
This model mlx-community/Ring-mini-linear-2.0-4bit was converted to MLX format from inclusionAI/Ring-mini-linear-2.0 using mlx-lm version 0.28.1.
MiniMax-M2.1-8bit
Qwen3-VL-30B-A3B-Thinking-4bit
Qwen3-4B-Instruct-2507-4bit-DWQ-2510
This model mlx-community/Qwen3-4B-Instruct-2507-4bit-DWQ-2510 was converted to MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.28.2.
Qwen3-Coder-30B-A3B-Instruct-4bit-dwq-v2
Qwen3-Coder-30B-A3B-Instruct-8bit
This model mlx-community/Qwen3-Coder-30B-A3B-Instruct-8bit was converted to MLX format from Qwen/Qwen3-Coder-30B-A3B-Instruct using mlx-lm version 0.26.1.
Qwen3-VL-30B-A3B-Thinking-3bit
Qwen3-VL-30B-A3B-Thinking-8bit
Qwen3-Coder-480B-A35B-Instruct-4bit
This model mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit was converted to MLX format from Qwen/Qwen3-Coder-480B-A35B-Instruct using mlx-lm version 0.26.0.
DeepSeek-V3.1-Terminus-4bit
This model mlx-community/DeepSeek-V3.1-Terminus-4bit was converted to MLX format from deepseek-ai/DeepSeek-V3.1-Terminus using mlx-lm version 0.27.1.
whisper-large-v3-turbo-q4
Qwen3-VL-30B-A3B-Thinking-bf16
mlx-community/Qwen3-VL-30B-A3B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Granite-4.0-H-Tiny-4bit-DWQ
This model mlx-community/granite-4.0-h-Tiny-4bit-DWQ was converted to MLX format from ibm-granite/granite-4.0-h-small using mlx-lm version 0.28.2.
Llama-3.2-3B-Instruct
Qwen3-Next-80B-A3B-Instruct-4bit
parakeet-tdt_ctc-0.6b-ja
This model was converted to MLX format from nvidia/parakeet-tdtctc-0.6b-ja using the conversion script. Please refer to original model card for more details on the model.
Mistral-7B-Instruct-v0.2-4-bit
Qwen3-30B-A3B-4bit
Llama-3.2-3B-Instruct-uncensored-6bit
Kimi-K2-Instruct-0905-mlx-DQ3_K_M
This model mlx-community/Kimi-K2-Instruct-0905-mlx-DQ3KM was converted to MLX format from moonshotai/Kimi-K2-Instruct-0905 using mlx-lm version 0.26.3. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Kimi K2 does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window. You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like Should you wish to squeeze more out of your quant, and you do not need to use a larger context window, you can change the last part of the above code to
Qwen2.5-Coder-7B-Instruct-bf16
Mixtral-8x22B-4bit
Qwen3 VL 4B Instruct 8bit
Llama-3.3-70B-Instruct-8bit
nvidia_Llama-3.1-Nemotron-70B-Instruct-HF_4bit
Huihui-GLM-4.5V-abliterated-mxfp4
mlx-community/Huihui-GLM-4.5V-abliterated-mxfp4 This model was converted to MLX format from [`huihui-ai/Huihui-GLM-4.5V-abliterated`]() using `mlx-vlm` with MXFP4 support. Refer to the original model card for more details on the model. Use with mlx
gemma-3-1b-pt-4bit
embeddinggemma-300m-8bit
DeepSeek-R1-Distill-Llama-70B-8bit
chandra-8bit
DeepSeek-Coder-V2-Lite-Instruct-8bit
embeddinggemma-300m-bf16
The Model mlx-community/embeddinggemma-300m-bf16 was converted to MLX format from google/embeddinggemma-300m using mlx-lm version 0.0.4.
Qwen3-0.6B-bf16
Qwen2.5-3B-Instruct-8bit
Nanonets-OCR2-3B-4bit
mlx-community/Nanonets-OCR2-3B-4bit This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Meta-Llama-3-8B-Instruct
Qwen3-VL-8B-Instruct-bf16
mlx-community/Qwen3-VL-8B-Instruct-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
GLM-4.5-Air-bf16
This model mlx-community/GLM-4.5-Air-bf16 was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.28.2.
Qwen3-VL-32B-Instruct-8bit
mlx-community/Qwen3-VL-32B-Instruct-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Ling-1T-mlx-3bit
This model mlx-community/Ling-1T-mlx-3bit/ was converted to MLX format from inclusionAI/Ling-1T using mlx-lm version 0.28.1. You can find more similar MLX model quants for Apple Mac Studio with 512 GB at https://huggingface.co/bibproj
Llama-4-Scout-17B-16E-Instruct-4bit
mlx-community/Llama-4-Scout-17B-16E-Instruct-4bit This model was converted to MLX format from [`meta-llama/Llama-4-Scout-17B-16E-Instruct`]() using mlx-vlm version 0.1.21. Refer to the original model card for more details on the model. Use with mlx
deepseek-r1-distill-qwen-1.5b
Qwen2.5-VL-7B-Instruct-8bit
Apriel-1.5-15b-Thinker-4bit
mlx-community/Apriel-1.5-15b-Thinker-4bit This model was converted to MLX format from [`ServiceNow-AI/Apriel-1.5-15b-Thinker`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
SmolVLM-Instruct-4bit
dolphin-vision-72b-4bit
Codestral-22B-v0.1-4bit
gemma-3-270m-it-4bit
This model mlx-community/gemma-3-270m-it-4bit was converted to MLX format from google/gemma-3-270m-it using mlx-lm version 0.26.3.
Qwen3-Embedding-0.6B-8bit
CodeLlama-70b-Instruct-hf-4bit-MLX
Qwen3-Coder-30B-A3B-Instruct-4bit-DWQ
Nanonets-OCR2-3B-bf16
mlx-community/Nanonets-OCR2-3B-bf16 This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-7B-Instruct-Uncensored-4bit
gemma-3-1b-it-8bit
MiniMax-M2.1-5bit
exaone-4.0-1.2b-4bit
LFM2-8B-A1B-8bit-MLX
Maintainer / Publisher: Susant Achary Upstream model: LiquidAI/LFM2-8B-A1B This repo (MLX 8-bit): `mlx-community/LFM2-8B-A1B-8bit-MLX` This repository provides an Apple-Silicon-optimized MLX build of LFM2-8B-A1B at 8-bit quantization for fast, on-device inference. - Architecture: Mixture-of-Experts (MoE) Transformer. - Size: ~8B total parameters with ~1B active per token (the “A1B” suffix commonly denotes ~1B active params). - Why MoE? During generation, only a subset of experts is activated per token, reducing compute per token while keeping a larger total parameter pool for expressivity. > Important memory note (single-device inference): > Although compute per token benefits from MoE (fewer active parameters), the full set of experts still resides in memory for typical single-GPU/CPU deployments. In practice this means RAM usage scales with total parameters, not with the smaller active count. - `config.json` (MLX), `mlxmodel.safetensors` (8-bit shards) - Tokenizer files: `tokenizer.json`, `tokenizerconfig.json` - Model metadata (e.g., `modelindex.json`) Target platform: macOS on Apple Silicon (M-series) using Metal/MPS. - General instruction-following, chat, and summarization - RAG back-ends and long-context workflows on device - Function-calling / structured outputs with schema-style prompts - Even at 8-bit, long contexts (KV-cache) can dominate memory at high `maxtokens` or large batch sizes. - As with any quantization, small regressions vs FP16 can appear on intricate math/code or edge-formatting. You asked to assume and decide RAM usage in absence of your measurements. Below are practical planning numbers derived from first-principles + experience with MLX and similar MoE models. Treat them as starting points and validate on your hardware. - Weights: `~ totalparams × 1 byte` (8-bit). For 8B params → ~8.0 GB baseline. - Runtime overhead: MLX graph + tensors + metadata → ~0.5–1.0 GB typical. - KV cache: grows with contextlength × layers × heads × dtype; often 1–3+ GB for long contexts. | Context window | Estimated peak RAM | |---|---:| | 4k tokens | ~9.5–10.5 GB | | 8k tokens | ~10.5–11.8 GB | | 16k tokens | ~12.0–14.0 GB | > These ranges assume 8-bit weights, A1B MoE (all experts resident), batch size = 1, and standard generation settings. > On lower windows (≤2k), you may see ~9–10 GB. Larger windows or batches will increase KV-cache and peak RAM. While this card is 8-bit, teams often want a consistent lineup. If you later produce 6/5/4/3/2-bit MLX builds, here’s a practical guide (RAM figures are indicative for an 8B MoE LM; your results depend on context/batch): | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to choose | |---|---:|:---:|---|---| | 4-bit | ~7–8 GB | 🔥🔥🔥 | Better detail retention | If 3-bit drops too much fidelity | | 6-bit | ~9–10.5 GB | 🔥🔥 | Near-max MLX quality | If you want accuracy under quant | | 8-bit (this repo) | ~9.5–12+ GB | 🔥🔥 | Highest quality among quant tiers | When RAM allows and you want the most faithful outputs | > MoE caveat: MoE reduces compute per token, but unless experts are paged/partitioned across devices and loaded on demand, memory still follows total parameters. On a single Mac, plan RAM as if the whole 8B parameter set is resident. Deterministic generation ```bash python -m mlxlm.generate \ --model mlx-community/LFM2-8B-A1B-8bit-MLX \ --prompt "Summarize the following in 5 bullet points:\n " \ --max-tokens 256 \ --temperature 0.0 \ --device mps \ --seed 0
gemma-3-12b-it-qat-abliterated-lm-4bit
FastVLM-0.5B-bf16
DeepSeek-R1-Distill-Qwen-32B-MLX-8Bit
Qwen3-8B-6bit
gemma-3-27b-it-4bit
Nanonets-OCR2-3B-8bit
mlx-community/Nanonets-OCR2-3B-8bit This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
GLM-4.5-Air-mxfp4
This model mlx-community/GLM-4.5-Air-mxfp4 was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.28.0.
SmolVLM2-256M-Video-Instruct-mlx
Qwen3-0.6B-4bit-DWQ-05092025
This model mlx-community/Qwen3-0.6B-4bit-DWQ-05092025 was converted to MLX format from Qwen/Qwen3-0.6B using mlx-lm version 0.24.0.
Dolphin-Mistral-24B-Venice-Edition-mlx-8Bit
mlx-community/Dolphin-Mistral-24B-Venice-Edition-mlx-8Bit MLX 8bit quant of cognitivecomputations/Dolphin-Mistral-24B-Venice-Edition released 2025-06-12. "This is an updated version based on feedback we received on v1", see discussion at original repo. The Model mlx-community/Dolphin-Mistral-24B-Venice-Edition-mlx-8Bit was converted to MLX format from cognitivecomputations/Dolphin-Mistral-24B-Venice-Edition using mlx-lm version 0.22.3.
LFM2-700M-8bit
This model mlx-community/LFM2-700M-8bit was converted to MLX format from LiquidAI/LFM2-700M using mlx-lm version 0.26.0.
Kimi-VL-A3B-Thinking-4bit
DeepSeek-R1-Distill-Llama-8B-4bit
Phi-3.5-vision-instruct-4bit
deepseek-vl2-8bit
Qwen3-30B-A3B-4bit-DWQ
DeepSeek-V3-4bit
Qwen3-VL-4B-Instruct-3bit
mlx-community/Qwen3-VL-4B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Meta-Llama-3.1-8B-Instruct-8bit
Kimi-Linear-48B-A3B-Instruct-4bit
This model mlx-community/Kimi-Linear-48B-A3B-Instruct-4bit was converted to MLX format from moonshotai/Kimi-Linear-48B-A3B-Instruct using mlx-lm version 0.28.4.
embeddinggemma-300m-4bit
DeepSeek-R1-Distill-Qwen-1.5B-3bit
whisper-tiny.en-mlx
Llama-3.2-8X4B-MOE-V2-Dark-Champion-Instruct-uncensored-abliterated-21B-Q_6-MLX
nomicai-modernbert-embed-base-4bit
The Model mlx-community/nomicai-modernbert-embed-base-4bit was converted to MLX format from nomic-ai/modernbert-embed-base using mlx-lm version 0.0.3.
GLM-4.5-4bit
This model mlx-community/GLM-4.5-4bit was converted to MLX format from zai-org/GLM-4.5 using mlx-lm version 0.26.0.
Qwen2.5-Coder-7B-Instruct-4bit
Llama-4-Maverick-17B-16E-Instruct-4bit
phi-2-hf-4bit-mlx
Qwen3-VL-8B-Thinking-4bit
mlx-community/Qwen3-VL-8B-Thinking-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-0.5B-Instruct-8bit
granite-4.0-h-micro-8bit
This model mlx-community/granite-4.0-h-micro-8bit was converted to MLX format from ibm-granite/granite-4.0-h-micro using mlx-lm version 0.28.2.
Ling-1T-mlx-DQ3_K_M
This model mlx-community/Ling-1T-mlx-DQ3KM was converted to MLX format from inclusionAI/Ling-1T using mlx-lm version 0.28.1. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Ling 1T does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window. In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like
OlmOCR 2 7B 1025 Bf16
mlx-community/olmOCR-2-7B-1025-bf16 This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-R1-Distill-Qwen-14B-4bit
GLM-4-9B-0414-4bit
embeddinggemma-300m-qat-q4_0-unquantized-bf16
mlx-community/embeddinggemma-300m-qat-q40-unquantized-bf16 The Model mlx-community/embeddinggemma-300m-qat-q40-unquantized-bf16 was converted to MLX format from google/embeddinggemma-300m-qat-q40-unquantized using mlx-lm version 0.0.4.
GLM-Z1-9B-0414-4bit
This model mlx-community/GLM-Z1-9B-0414-4bit was converted to MLX format from THUDM/GLM-Z1-9B-0414 using mlx-lm version 0.22.4.
gemma-3-12b-it-4bit
gemma-3-12b-it-bf16
DeepSeek-R1-Distill-Qwen-32B-abliterated-4bit
Qwen3-VL-32B-Instruct-4bit
mlx-community/Qwen3-VL-32B-Instruct-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
whisper-turbo
GLM-4-32B-0414-8bit
This model mlx-community/GLM-4-32B-0414-8bit was converted to MLX format from THUDM/GLM-4-32B-0414 using mlx-lm version 0.23.1.
Apertus-8B-Instruct-2509-bf16
This model mlx-community/Apertus-8B-Instruct-2509-bf16 was converted to MLX format from swiss-ai/Apertus-8B-Instruct-2509 using mlx-lm version 0.27.0.
Qwen3-VL-8B-Thinking-bf16
mlx-community/Qwen3-VL-8B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
olmOCR-2-7B-1025-4bit
Qwen3-VL-4B-Thinking-bf16
mlx-community/Qwen3-VL-4B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Meta-Llama-3.1-8B-Instruct-bf16
granite-4.0-h-tiny-3bit-MLX
Granite-4.0-H-Tiny — MLX 3-bit (Apple Silicon) Maintainer / Publisher: Susant Achary This repository provides an Apple-Silicon-optimized MLX build of IBM Granite-4.0-H-Tiny with 3-bit weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows). Granite 4.0 is IBM’s latest hybrid Mamba-2/Transformer family with selective Mixture-of-Experts (MoE), designed for long-context, hyper-efficient inference and enterprise use. :contentReference[oaicite:0]{index=0} 🔎 What’s Granite 4.0? - Architecture. Hybrid Mamba-2 + softmax attention; H variants add MoE routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1} - Efficiency claims. Up to ~70% lower memory and ~2× faster inference vs. comparable models, especially for multi-session and long-context scenarios. :contentReference[oaicite:2]{index=2} - Context window. 128k tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3} - Licensing. Apache-2.0 for public/commercial use. :contentReference[oaicite:4]{index=4} > This MLX build targets Granite-4.0-H-Tiny (≈ 7B total, ≈ 1B active parameters). For reference, the family also includes H-Small (≈32B total / 9B active) and Micro/Micro-H (≈3B dense/hybrid) tiers. :contentReference[oaicite:5]{index=5} 📦 What’s in this repo (MLX format) - `config.json` (MLX), `mlxmodel.safetensors` (3-bit shards), tokenizer files, and processor metadata. - Ready for macOS on M-series chips via Metal/MPS. > The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: ibm-granite/granite-4.0-h-tiny. :contentReference[oaicite:6]{index=6} ✅ Intended use - General instruction-following and chat with long context (128k). :contentReference[oaicite:7]{index=7} - Enterprise assistant patterns (function calling, structured outputs) and RAG backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8} - On-device development on Macs (MLX), low-latency local prototyping and evaluation. ⚠️ Limitations - As a quantized, decoder-only LM, it can produce confident but wrong outputs—review for critical use. - 2–4-bit quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows. - Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9} 🧠 Model family at a glance | Tier | Arch | Params (total / active) | Notes | |---|---|---:|---| | H-Small | Hybrid + MoE | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} | | H-Tiny (this repo) | Hybrid + MoE | ~7B / 1B | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} | | Micro / H-Micro | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} | Context Window: up to 128k tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13} License: Apache-2.0. :contentReference[oaicite:14]{index=14} 🧪 Observed on-device behavior (MLX) Empirically on M-series Macs: - 3-bit often gives crisp, direct answers with good latency and modest RAM. - Higher bit-widths (4/5/6-bit) improve faithfulness on fine-grained tasks (tiny OCR, structured parsing), at higher memory cost. > Performance varies by Mac model, image/token lengths, and temperature; validate on your workload. 🔢 Choosing a quantization level (Apple Silicon) | Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose | |---|---:|:---:|---|---| | 2-bit | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests | | 3-bit (this build) | ~5–6 GB | 🔥🔥🔥🔥 | Direct, concise, great latency | Default for local dev on M1/M2/M3/M4 | | 4-bit | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness | | 5-bit | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs | | 6-bit | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample | > Figures are indicative for language-only Tiny (no vision), and will vary with context length and KV cache size. 🚀 Quickstart (CLI — MLX) ```bash Plain generation (deterministic) python -m mlxlm.generate \ --model \ --prompt "Summarize the following notes into 5 bullet points:\n " \ --max-tokens 200 \ --temperature 0.0 \ --device mps \ --seed 0
GLM-4-32B-0414-4bit
CodeLlama-13b-Instruct-hf-4bit-MLX
Nanonets-OCR-s-bf16
mlx-community/Nanonets-OCR-s-bf16 This model was converted to MLX format from [`nanonets/Nanonets-OCR-s`]() using mlx-vlm version 0.1.27. Refer to the original model card for more details on the model. Use with mlx
Qwen3-VL-32B-Thinking-4bit
mlx-community/Qwen3-VL-32B-Thinking-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen3-VL-2B-Instruct-3bit
mlx-community/Qwen3-VL-2B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
distil-whisper-large-v3
DeepSeek-R1-0528-4bit
This model mlx-community/DeepSeek-R1-0528-4bit was converted to MLX format from deepseek-ai/DeepSeek-R1-0528 using mlx-lm version 0.24.1.
GLM-4.5-Air-2bit
This model mlx-community/GLM-4.5-Air-2bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.1.
InternVL3_5-GPT-OSS-20B-A4B-Preview-4bit
mlx-community/InternVL35-GPT-OSS-20B-A4B-Preview-4bit This model was converted to MLX format from [`OpenGVLab/InternVL35-GPT-OSS-20B-A4B-Preview-HF`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Chatterbox-TTS-4bit
plamo-2-translate
This is a 4-bit quantized version of the PLaMo 2 Translation Model with DWQ (Distilled Weight Quantization) for inference with MLX on Apple Silicon devices. PLaMo翻訳モデルはPreferred Networksによって開発された翻訳向け特化型大規模言語モデルです。 詳しくはブログ記事およびプレスリリースを参照してください。 PLaMo Translation Model is a specialized large-scale language model developed by Preferred Networks for translation tasks. For details, please refer to the blog post and press release. List of models: - plamo-2-translate ... Post-trained model for translation - plamo-2-translate-base ... Base model for translation - plamo-2-translate-eval ... Pair-wise evaluation model PLaMo Translation Model is released under PLaMo community license. Please check the following license and agree to this before downloading. - (EN) under construction: we apologize for the inconvenience - (JA) https://www.preferred.jp/ja/plamo-community-license/ NOTE: This model has NOT been instruction-tuned for chat dialog or other downstream tasks. Please check the PLaMo community license and contact us via the following form to use commercial purpose. PLaMo Translation Model is a new technology that carries risks with use. Testing conducted to date has been in English and Japanese, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, PLaMo Translation Model’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of PLaMo Translation Model, developers should perform safety testing and tuning tailored to their specific applications of the model. This model is trained under the project, “Research and Development Project of the Enhanced Infrastructures for Post 5G Information and Communication System” (JPNP 20017), subsidized by the New Energy and Industrial Technology Development Organization (NEDO). - (EN) https://www.preferred.jp/en/company/aipolicy/ - (JA) https://www.preferred.jp/ja/company/aipolicy/
Llama-3.2-11B-Vision-Instruct-4bit
CodeLlama-7b-Python-4bit-MLX
gemma-3-12b-it-8bit
Qwen2.5-1.5B-Instruct-8bit
Qwen3-VL-4B-Thinking-4bit
mlx-community/Qwen3-VL-4B-Thinking-4bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen2.5-14B-Instruct-4bit
Mixtral-8x7B-Instruct-v0.1
parakeet-tdt-1.1b
Qwen3-Next-80B-A3B-Instruct-8bit
Llama-4-Scout-17B-16E-Instruct-8bit
Qwen3 VL 8B Thinking 6bit
mlx-community/Qwen3-VL-8B-Thinking-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Qwen2-VL-7B-Instruct-4bit
gemma-3-27b-it-qat-8bit
DeepSeek-V3.1-8bit
GLM-4.5V-8bit
Hermes-3-Llama-3.1-8B-4bit
Qwen3-VL-32B-Thinking-bf16
parakeet-ctc-0.6b
Llama-4-Scout-17B-16E-Instruct-6bit
deepcogito-cogito-v1-preview-llama-8B-4bit
Qwen3-VL-30B-A3B-Instruct-6bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
mxbai-embed-large-v1
Llama-3-8B-Instruct-1048k-4bit
OpenELM-270M-Instruct
GLM-4.5-Air-8bit
This model mlx-community/GLM-4.5-Air-8bit was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.0.
Qwen3-VL-4B-Instruct-5bit
mlx-community/Qwen3-VL-4B-Instruct-5bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Mistral-7B-Instruct-v0.2
DeepSeek-R1-Distill-Llama-70B-4bit
Qwen3-VL-32B-Thinking-8bit
GLM-4.5-Air-3bit-DWQ-v2
Qwen3-VL-8B-Instruct-8bit
mlx-community/Qwen3-VL-8B-Instruct-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-8B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Nous-Hermes-2-Mixtral-8x7B-DPO-4bit
Phi-3-mini-128k-instruct-4bit
Qwen2.5-VL-72B-Instruct-4bit
Meta-Llama-3.1-405B-4bit
Qwen3-Next-80B-A3B-Thinking-4bit
This model mlx-community/Qwen3-Next-80B-A3B-Thinking-4bit was converted to MLX format from Qwen/Qwen3-Next-80B-A3B-Thinking using mlx-lm version 0.27.1.
Jinx-gpt-oss-20b-mxfp4-mlx
This model mlx-community/Jinx-gpt-oss-20b-mxfp4-mlx was converted to MLX format from Jinx-org/Jinx-gpt-oss-20b-mxfp4 using mlx-lm version 0.27.1.
Ministral-3-8B-Instruct-2512-4bit
Llama-4-Scout-17B-16E-4bit
Qwen3-14B-4bit
NVIDIA-Nemotron-Nano-9B-v2-4bits
Kimi-K2-Instruct-0905-mlx-3bit
mlx-community/moonshotaiKimi-K2-Instruct-0905-mlx-3bit This model mlx-community/moonshotaiKimi-K2-Instruct-0905-mlx-3bit was converted to MLX format from moonshotai/Kimi-K2-Instruct-0905 using mlx-lm version 0.26.3.
Llama-3_3-Nemotron-Super-49B-v1_5-mlx-4Bit
mlx-community/Llama-33-Nemotron-Super-49B-v15-mlx-4Bit The Model mlx-community/Llama-33-Nemotron-Super-49B-v15-mlx-4Bit was converted to MLX format from unsloth/Llama-33-Nemotron-Super-49B-v15 using mlx-lm version 0.26.4.
gemma-2-27b-it-4bit
Qwen3-VL-30B-A3B-Instruct-3bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-3bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-Coder-V2-Lite-Instruct-4bit-AWQ
chandra-bf16
Qwen3-1.7B-MLX-MXFP4
This model mlx-community/Qwen3-1.7B-MLX-MXFP4 was converted to MLX format from Qwen/Qwen3-1.7B using mlx-lm version 0.28.3.
Kokoro-82M-4bit
Phi-3-mini-4k-instruct-4bit-no-q-embed
gemma-3-27b-it-8bit
Qwen3-VL-30B-A3B-Thinking-6bit
mlx-community/Qwen3-VL-30B-A3B-Thinking-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
NousResearch_Hermes-4-14B-BF16-abliterated-mlx
gemma-3-4b-it-5bit
This model mlx-community/gemma-3-4b-it-5bit was converted to MLX format from google/gemma-3-4b-it using mlx-lm version 0.28.2.
Chandra 4bit
mlx-community/chandra-4bit This model was converted to MLX format from [`datalab-to/chandra`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
olmOCR-2-7B-1025-8bit
mlx-community/olmOCR-2-7B-1025-8bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Llama-3.1-Nemotron-70B-Instruct-HF-bf16
Qwen3-4B-6bit
Mistral-7B-Instruct-v0.2-4bit
Llama-3.2-90B-Vision-Instruct-4bit
GLM-4.5V-abliterated-4bit
mlx-community/GLM-4.5V-abliterated-4bit This model was converted to MLX format from [`huihui-ai/Huihui-GLM-4.5V-abliterated`]() using mlx-vlm. Refer to the original model card for more details on the model. Use with mlx
quantized-gemma-2b-it
Fara-7B-4bit
Meta-Llama-3-70B-Instruct-4bit
olmOCR-2-7B-1025-mlx-8bit
mlx-community/olmOCR-2-7B-1025-mlx-8bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
TinyLlama-1.1B-Chat-v1.0-4bit
Unsloth-Phi-4-4bit
Qwen2.5-Coder-14B-Instruct-4bit
GLM-4.5V-abliterated-8bit
mlx-community/GLM-4.5V-abliterated-8bit This model was converted to MLX format from [`huihui-ai/Huihui-GLM-4.5V-abliterated`]() using mlx-vlm. Refer to the original model card for more details on the model. Use with mlx
jinaai-ReaderLM-v2
Apertus-8B-Instruct-2509-4bit
Meta-Llama-3.1-70B-Instruct-bf16-CORRECTED
Qwen3-VL-4B-Thinking-8bit
mlx-community/Qwen3-VL-4B-Thinking-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
paligemma-3b-mix-448-8bit
whisper-tiny-mlx
phi-4-4bit
The Model mlx-community/phi-4-4bit was converted to MLX format from microsoft/phi-4 using mlx-lm version 0.21.0.
llava-phi-3-mini-4bit
GLM-4.5-Air-3bit-DWQ
This model mlx-community/GLM-4.5-Air-3bit-DWQ was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.1.
Qwen2.5-Coder-1.5B-Instruct-4bit
granite-4.0-h-1b-6bit
This model mlx-community/granite-4.0-h-1b-6bit was converted to MLX format from ibm-granite/granite-4.0-h-1b using mlx-lm version 0.28.4.
Qwen2.5-32B-Instruct-4bit
Mistral-Large-Instruct-2407-4bit
Apriel-1.5-15b-Thinker-8bit
Qwen3-14B-4bit-AWQ
DeepSeek-R1-Qwen3-0528-8B-4bit-AWQ
granite-4.0-h-1b-8bit
This model mlx-community/granite-4.0-h-1b-8bit was converted to MLX format from ibm-granite/granite-4.0-h-1b using mlx-lm version 0.28.4.
Qwen3-4B-Thinking-2507-fp16
granite-4.0-h-350m-8bit
Qwen2.5-Coder-32B-Instruct-4bit
Huihui-gemma-3n-E4B-it-abliterated-lm-8bit
Phi-3-vision-128k-instruct-4bit
Nous-Hermes-2-Mistral-7B-DPO-4bit-MLX
Josiefied Qwen3 30B A3B Abliterated V2 4bit
AI21-Jamba-Reasoning-3B-4bit
This model mlx-community/AI21-Jamba-Reasoning-3B-4bit was converted to MLX format from ai21labs/AI21-Jamba-Reasoning-3B using mlx-lm version 0.28.2.
DeepSeek-Coder-V2-Instruct-AQ4_1
Josiefied-Qwen3-4B-Instruct-2507-abliterated-v1-8bit
Ministral-8B-Instruct-2410-4bit
Josiefied-Qwen3-8B-abliterated-v1-4bit
UTENA-7B-NSFW-V2-4bit
olmOCR-2-7B-1025-mlx-4bit
mlx-community/olmOCR-2-7B-1025-mlx-4bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.5. Refer to the original model card for more details on the model. Use with mlx
parakeet-tdt_ctc-1.1b
DeepSeek-Coder-V2-Lite-Instruct-4bit
SmolVLM2-2.2B-Instruct-mlx
Mistral-7B-v0.1-LoRA-Text2SQL
gemma-3n-E2B-it-lm-bf16
csm-1b
Llama-4-Maverick-17B-16E-Instruct-6bit
mlx-community/Llama-4-Maverick-17B-16E-Instruct-6bit This model mlx-community/Llama-4-Maverick-17B-16E-Instruct-6bit was converted to MLX format from meta-llama/Llama-4-Maverick-17B-128E-Instruct using mlx-lm version 0.22.3.
SmolLM-135M-4bit
DeepSeek-V3.1-mlx-DQ5_K_M
This model mlx-community/DeepSeek-V3.1-mlx-DQ5KM was converted to MLX format from deepseek-ai/DeepSeek-V3.1 using mlx-lm version 0.26.3. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. With 512 GB, we can do better than the 4-bit version of DeepSeek V3.1. Using research results, we aim to get better than 5-bit performance using smarter quantization. We aim to not have the quant so large that it leaves no memory for a useful context window. The temperature of 1.3 is DeepSeek's recommendation for translations. For coding, you should probably use a temperature of 0.6 or lower. In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In this case we did not want a improved 3-bit quant, but rather the best possible "5-bit" quant. We therefore modified the `DQ3KM` quantization by replacing 3-bit by 5-bit, 4-bit by 6-bit, and 6-bit by 8-bit to create a new `DQ5KM` quant. This produces a quantization of 5.638 bpw (bits per weight). In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like Should you wish to squeeze more out of your quant, and you do not need to use a larger context window, you can change the last part of the above code to
Ring-flash-linear-2.0-128k-4bit
This model mlx-community/Ring-flash-linear-2.0-128k-4bit was converted to MLX format from inclusionAI/Ring-flash-linear-2.0-128k using mlx-lm version 0.28.2.
Qwen3-Coder-30B-A3B-Instruct-3bit
This model mlx-community/Qwen3-Coder-30B-A3B-Instruct-3bit was converted to MLX format from Qwen/Qwen3-Coder-30B-A3B-Instruct using mlx-lm version 0.26.1.
whisper-large-v3-mlx-8bit
Qwen3-30B-A3B-bf16
Qwen3-30B-A3B-Instruct-2507-6bit
This model mlx-community/Qwen3-30B-A3B-Instruct-2507-6bit was converted to MLX format from Qwen/Qwen3-30B-A3B-Instruct-2507 using mlx-lm version 0.26.1.
meta-llama-Llama-4-Scout-17B-16E-4bit
Qwen3-235B-A22B-Thinking-2507-3bit-DWQ
mlx-community/Qwen3-235B-A22B-Thinking-2507-3bit-DWQ This model mlx-community/Qwen3-235B-A22B-Thinking-2507-3bit-DWQ was converted to MLX format from Qwen/Qwen3-235B-A22B-Thinking-2507 using mlx-lm version 0.26.0.
DeepSeek-R1-Distill-Qwen-14B-8bit
gemma-3-27b-it-qat-bf16
GLM-4.5-Air-2bit-DWQ
This model mlx-community/GLM-4.5-Air-2bit-DWQ was converted to MLX format from zai-org/GLM-4.5-Air using mlx-lm version 0.26.2.
GLM-4-9B-0414-8bit
DeepSeek-V3.1-Base-4bit
deepseek-coder-33b-instruct-hf-4bit-mlx
Qwen3-VL-30B-A3B-Instruct-5bit
mlx-community/Qwen3-VL-30B-A3B-Instruct-5bit This model was converted to MLX format from [`Qwen/Qwen3-VL-30B-A3B-Instruct`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Qwen3-Next-80B-A3B-Thinking-8bit
moonshotai_Kimi-K2-Instruct-mlx-3bit
This model mlx-community/moonshotaiKimi-K2-Instruct-mlx-3bit was converted to MLX format from moonshotai/Kimi-K2-Instruct using mlx-lm version 0.26.3.
UserLM-8b-8bit
Qwen2.5-7B-Instruct-1M-4bit
Llama-3.1-8B-Instruct
Llama-4-Maverick-17B-128E-Instruct-4bit
Apriel 1.5 15b Thinker 6bit MLX
Apriel-1.5-15B-Thinker — MLX Quantized (Apple Silicon) Format: MLX (Apple Silicon) Variants: 6-bit (recommended) Base model: ServiceNow-AI/Apriel-1.5-15B-Thinker Architecture: Pixtral-style LLaVA (vision encoder → 2-layer projector → decoder) Intended use: image understanding & grounded reasoning; document/chart/OCR-style tasks; math/coding Q&A with visual context. > This repository provides MLX-format weights for Apple Silicon (M-series) built from the original Apriel-1.5-15B-Thinker release. It is optimized for on-device inference with small memory footprints and fast startup on macOS. Apriel-1.5-15B-Thinker is a 15B open-weights multimodal reasoning model trained via a data-centric mid-training recipe rather than RLHF/RM. Starting from Pixtral-12B as the base, the authors apply: 1) Depth Upscaling (capacity expansion without pretraining from scratch), 2) Two-stage multimodal continual pretraining (CPT) to build text + visual reasoning, and 3) High-quality SFT with explicit reasoning traces across math, coding, science, and tool use. This approach delivers frontier-level capability on compact compute. :contentReference[oaicite:0]{index=0} Key reported results (original model) - AAI Index: 52, matching DeepSeek-R1-0528 at far lower compute. :contentReference[oaicite:1]{index=1} - Multimodal: On 10 image benchmarks, within ~5 points of Gemini-2.5-Flash and Claude Sonnet-3.7 on average. :contentReference[oaicite:2]{index=2} - Designed for single-GPU / constrained deployment scenarios. :contentReference[oaicite:3]{index=3} > Notes above summarize the upstream paper; MLX quantization can slightly affect absolute scores. Always validate on your use case. - Backbone: Pixtral-12B-Base-2409 adapted to a larger 15B decoder via depth upscaling (layers 40 → 48), then re-aligned with a 2-layer projection network connecting the vision encoder and decoder. :contentReference[oaicite:4]{index=4} - Training stack: - CPT Stage-1: mixed tokens (≈50% text, 20% replay, 30% multimodal) for foundational reasoning & image understanding; 32k context; cosine LR with warmup; all components unfrozen; checkpoint averaging. :contentReference[oaicite:5]{index=5} - CPT Stage-2: targeted synthetic visual tasks (reconstruction, visual matching, detection, counting) to strengthen spatial/compositional/fine-grained reasoning; vision encoder frozen; loss on responses for instruct data; 16k context. :contentReference[oaicite:6]{index=6} - SFT: curated instruction-response pairs with explicit reasoning traces (math, coding, science, tools). :contentReference[oaicite:7]{index=7} - Why MLX? Native Apple-Silicon inference with small binaries, fast load, and low memory overhead. - What’s included: `config.json`, `mlxmodel.safetensors` (sharded), tokenizer & processor files, and metadata for VLM pipelines. - Quantization options: - 6-bit (recommended): best balance of quality & memory. > Tip: If you’re capacity-constrained on an M1/M2, try 6-bit first; ```bash Basic image caption python -m mlxvlm.generate \ --model \ --image /path/to/image.jpg \ --prompt "Describe this image." \ --max-tokens 128 --temperature 0.0 --device mps
DeepSeek-R1-0528-Qwen3-8B-4bit-DWQ
This model mlx-community/DeepSeek-R1-0528-Qwen3-8B-4bit-DWQ was converted to MLX format from deepseek-ai/DeepSeek-R1-0528-Qwen3-8B using mlx-lm version 0.24.1.
all-MiniLM-L6-v2-4bit
InternVL3_5-30B-A3B-4bit
mlx-community/InternVL35-30B-A3B-4bit This model was converted to MLX format from [`OpenGVLab/InternVL35-30B-A3B-HF`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
mistral-7B-v0.1
LFM2-8B-A1B-8bit
Qwen3-VL-30B-A3B-Thinking-5bit
DeepSeek-R1-Distill-Qwen-14B-6bit
Codestral-22B-v0.1-8bit
GLM-Z1-32B-0414-4bit
Qwen3-Coder-30B-A3B-Instruct-8bit-DWQ-lr9e8
bge-small-en-v1.5-4bit
DeepSeek-R1-3bit
chatterbox-4bit
Nanonets-OCR2-3B-6bit
mlx-community/Nanonets-OCR2-3B-6bit This model was converted to MLX format from [`nanonets/Nanonets-OCR2-3B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-v3-0324-8bit
This model mlx-community/DeepSeek-v3-0324-8bit was converted to MLX format from deepseek-ai/DeepSeek-v3-0324 using mlx-lm version 0.22.2.
Ring-1T-mlx-DQ3_K_M
This model mlx-community/Ring-1T-mlx-DQ3KM was converted to MLX format from inclusionAI/Ring-1T using mlx-lm version 0.28.1. This is created for people using a single Apple Mac Studio M3 Ultra with 512 GB. The 4-bit version of Ring 1T does not fit. Using research results, we aim to get 4-bit performance from a slightly smaller and smarter quantization. It should also not be so large that it leaves no memory for a useful context window. In the Arxiv paper Quantitative Analysis of Performance Drop in DeepSeek Model Quantization the authors write, > We further propose `DQ3KM`, a dynamic 3-bit quantization method that significantly outperforms traditional `Q3KM` variant on various benchmarks, which is also comparable with 4-bit quantization (`Q4KM`) approach in most tasks. > dynamic 3-bit quantization method (`DQ3KM`) that outperforms the 3-bit quantization implementation in `llama.cpp` and achieves performance comparable to 4-bit quantization across multiple benchmarks. The resulting multi-bitwidth quantization has been well tested and documented. In the `convert.py` file of mlx-lm on your system ( you can see the original code here ), replace the code inside `def mixedquantpredicate()` with something like
olmOCR-2-7B-1025-5bit
mlx-community/olmOCR-2-7B-1025-5bit This model was converted to MLX format from [`allenai/olmOCR-2-7B-1025`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
DeepSeek-R1-Distill-Qwen-7B-8bit
plamo-2-1b
Llama-3.2-3B-Instruct-abliterated-6bit
embeddinggemma-300m-qat-q8_0-unquantized-bf16
mlx-community/embeddinggemma-300m-qat-q80-unquantized-bf16 The Model mlx-community/embeddinggemma-300m-qat-q80-unquantized-bf16 was converted to MLX format from google/embeddinggemma-300m-qat-q80-unquantized using mlx-lm version 0.0.4.
Qwen3-4B-Instruct-2507-8bit
This model mlx-community/Qwen3-4B-Instruct-2507-8bit was converted to MLX format from Qwen/Qwen3-4B-Instruct-2507 using mlx-lm version 0.26.2.
Llama-3.3-70B-Instruct-bf16
Qwen3-VL-32B-Instruct-bf16
mlx-community/Qwen3-VL-32B-Instruct-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-32B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
codegemma-7b-it-8bit
Llama-3.1-8B-Instruct-4bit
The Model mlx-community/Llama-3.1-8B-Instruct-4bit was converted to MLX format from meta-llama/Llama-3.1-8B-Instruct using mlx-lm version 0.21.4.
Qwen3-Next-80B-A3B-Instruct-5bit
rnj-1-instruct-4bit
granite-3.3-8b-instruct-4bit
Qwen3-8B-4bit-DWQ-053125
This model mlx-community/Qwen3-8B-4bit-DWQ-053125 was converted to MLX format from Qwen/Qwen3-8B using mlx-lm version 0.24.1.
c4ai-command-r-plus-4bit
Qwen2.5-72B-Instruct-4bit
gemma-3-27b-it-4bit-DWQ
This model mlx-community/gemma-3-27b-it-4bit-DWQ was converted to MLX format from google/gemma-3-27b-it using mlx-lm version 0.24.0.
dolphin-2.9-llama3-70b-4bit
Mistral-Small-24B-Instruct-2501-4bit
llava-v1.6-mistral-7b-4bit
gemma-3-1b-it-bf16
dac-speech-24khz-1.5kbps
Llama-OuteTTS-1.0-1B-4bit
LongCat-Flash-Chat-4bit
granite-4.0-h-1b-base-8bit
This model mlx-community/granite-4.0-h-1b-base-8bit was converted to MLX format from ibm-granite/granite-4.0-h-1b-base using mlx-lm version 0.28.4.
Llama-3.3-70B-Instruct-3bit
deepseek-coder-33b-instruct
Kimi-Linear-48B-A3B-Instruct-6bit
This model mlx-community/Kimi-Linear-48B-A3B-Instruct-6bit was converted to MLX format from moonshotai/Kimi-Linear-48B-A3B-Instruct using mlx-lm version 0.28.4.
bitnet-b1.58-2B-4T-4bit
This model mlx-community/bitnet-b1.58-2B-4T-4bit was converted to MLX format from microsoft/bitnet-b1.58-2B-4T using mlx-lm version 0.25.1.
MinerU2.5-2509-1.2B-bf16
mlx-community/MinerU2.5-2509-1.2B-bf16 This model was converted to MLX format from [`opendatalab/MinerU2.5-2509-1.2B`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Mistral-Small-3.1-24B-Instruct-2503-4bit
Mixtral-8x7B-Instruct-v0.1-hf-4bit-mlx
Llama-3.1-Nemotron-Nano-4B-v1.1-4bit
This model mlx-community/Llama-3.1-Nemotron-Nano-4B-v1.1-4bit was converted to MLX format from nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 using mlx-lm version 0.25.0.
Apriel-1.5-15b-Thinker-bf16
mlx-community/Apriel-1.5-15b-Thinker-bf16 This model was converted to MLX format from [`ServiceNow-AI/Apriel-1.5-15b-Thinker`]() using mlx-vlm version 0.3.3. Refer to the original model card for more details on the model. Use with mlx
Qwen3-30B-A3B-Thinking-2507-4bit
This model mlx-community/Qwen3-30B-A3B-Thinking-2507-4bit was converted to MLX format from Qwen/Qwen3-30B-A3B-Thinking-2507 using mlx-lm version 0.26.3.
LFM2-8B-A1B-fp16
Qwen2.5-VL-32B-Instruct-4bit
Qwen3-14B-4bit-DWQ-053125
This model mlx-community/Qwen3-14B-4bit-DWQ-053125 was converted to MLX format from Qwen/Qwen3-14B using mlx-lm version 0.24.1.
meta-llama-Llama-4-Scout-17B-16E-fp16
gemma-3-4b-it-bf16
deepseek-coder-6.7b-instruct-hf-4bit-mlx
gemma-3-1b-it-4bit-DWQ
This model mlx-community/gemma-3-1b-it-4bit-DWQ was converted to MLX format from google/gemma-3-1b-it using mlx-lm version 0.24.1.
gemma-3n-E4B-it-bf16
LongCat-Flash-Chat-mlx-DQ6_K_M
gemma-3-270m-it-bf16
This model mlx-community/gemma-3-270m-it-bf16 was converted to MLX format from google/gemma-3-270m-it using mlx-lm version 0.26.3.
whisper-medium-mlx-4bit
Qwen3-14B-6bit
gpt2-base-mlx
LFM2-VL-450M-8bit
starcoder2-7b-4bit
Ling-mini-2.0-4bit
This model mlx-community/Ling-mini-2.0-4bit was converted to MLX format from inclusionAI/Ling-mini-2.0 using mlx-lm version 0.27.1.
LLaDA2.0-mini-preview-4bit
This model mlx-community/LLaDA2.0-mini-preview-4bit was converted to MLX format from inclusionAI/LLaDA2.0-mini-preview using mlx-lm version 0.28.4.
Qwen3-4B-4bit-DWQ-053125
Dolphin-Mistral-24B-Venice-Edition-4bit
mlx-community/Dolphin-Mistral-24B-Venice-Edition-4bit This model mlx-community/Dolphin-Mistral-24B-Venice-Edition-4bit was converted to MLX format from cognitivecomputations/Dolphin-Mistral-24B-Venice-Edition using mlx-lm version 0.25.3.
Llama-3-8B-Instruct-1048k-8bit
conikeec-deepseek-coder-6.7b-instruct
Josiefied-DeepSeek-R1-0528-Qwen3-8B-abliterated-v1-4bit
mlx-community/Josiefied-DeepSeek-R1-0528-Qwen3-8B-abliterated-v1-4bit This model mlx-community/Josiefied-DeepSeek-R1-0528-Qwen3-8B-abliterated-v1-4bit was converted to MLX format from Goekdeniz-Guelmez/Josiefied-DeepSeek-R1-0528-Qwen3-8B-abliterated-v1 using mlx-lm version 0.24.1.
Apertus-8B-Instruct-2509-8bit
Gemma-3-Glitter-12B-8bit
gemma-3-12b-it-4bit-DWQ
This model mlx-community/gemma3-12b-it-4bit-DWQ was converted to MLX format from google/gemma-3-12b-it using mlx-lm version 0.24.0.
Gabliterated-Qwen3-0.6B-4bit
This model mlx-community/Gabliterated-Qwen3-0.6B-4bit was converted to MLX format from Goekdeniz-Guelmez/Gabliterated-Qwen3-0.6B using mlx-lm version 0.25.2.
gemma-3-270m-4bit
This model mlx-community/gemma-3-270m-4bit was converted to MLX format from google/gemma-3-270m using mlx-lm version 0.26.3.
Qwen3-VL-2B-Thinking-bf16
mlx-community/Qwen3-VL-2B-Thinking-bf16 This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Thinking`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
chatterbox-fp16
gemma-2-27b-bf16
Qwen3-VL-4B-Instruct-6bit
mlx-community/Qwen3-VL-4B-Instruct-6bit This model was converted to MLX format from [`Qwen/Qwen3-VL-4B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
Mistral-7B-Instruct-v0.3-8bit
nomicai-modernbert-embed-base-bf16
bitnet-b1.58-2B-4T-8bit
This model mlx-community/bitnet-b1.58-2B-4T-8bit was converted to MLX format from microsoft/bitnet-b1.58-2B-4T using mlx-lm version 0.25.1.
Qwen3-Coder-30B-A3B-Instruct-bf16
This model mlx-community/Qwen3-Coder-30B-A3B-Instruct-bf16 was converted to MLX format from Qwen/Qwen3-Coder-30B-A3B-Instruct using mlx-lm version 0.26.2.
LFM2-8B-A1B-6bit
This model mlx-community/LFM2-8B-A1B-6bit was converted to MLX format from LiquidAI/LFM2-8B-A1B using mlx-lm version 0.28.2.
gemma-3n-E4B-it-lm-bf16
Qwen2.5-Coder-1.5B-4bit
gemma-3-270m-it-qat-4bit
This model mlx-community/gemma-3-270m-it-qat-4bit was converted to MLX format from google/gemma-3-270m-it-qat using mlx-lm version 0.26.3.
DeepSeek-R1-Distill-Qwen-1.5B-6bit
medgemma-27b-it-8bit
gemma-3-27b-it-bf16
orpheus-3b-0.1-ft-4bit
This model mlx-community/orpheus-3b-0.1-ft-4bit was converted to MLX format from canopylabs/orpheus-3b-0.1-ft using mlx-audio version 0.0.3.
meta-llama-Llama-4-Scout-17B-16E-Instruct-bf16
c4ai-command-r-v01-4bit
Llama-3.2-8X4B-MOE-V2-Dark-Champion-Instruct-uncensored-abliterated-21B-MLX
Qwen3-1.7B-4bit-DWQ-053125
Qwen3-4B-Instruct-2507-5bit
LFM2-8B-A1B-6bit-MLX
Maintainer / Publisher: Susant Achary Upstream model: LiquidAI/LFM2-8B-A1B This repo (MLX 6-bit): `mlx-community/LFM2-8B-A1B-6bit-MLX` This repository provides an Apple-Silicon-optimized MLX build of LFM2-8B-A1B at 6-bit quantization. Among quantized tiers, 6-bit is a strong fidelity sweet-spot for many Macs—noticeably smaller than FP16/8-bit while preserving answer quality for instruction following, summarization, and structured extraction. - Architecture: Mixture-of-Experts (MoE) Transformer. - Size: ~8B total parameters with ~1B active per token (A1B ≈ “~1B active”). - Why MoE? At each token, a subset of experts is activated, reducing compute per token while keeping a larger parameter pool for expressivity. > Single-device memory reality: Even though only ~1B are active per token, all experts typically reside in memory during inference on one device. That means RAM planning should track total parameters, not just the active slice. - `config.json` (MLX), `mlxmodel.safetensors` (6-bit shards) - Tokenizer files: `tokenizer.json`, `tokenizerconfig.json` - Model metadata (e.g., `modelindex.json`) Target: macOS on Apple Silicon (M-series) with Metal/MPS. - General instruction following, chat, and summarization - RAG and long-context assistants on device - Schema-guided structured outputs (JSON) - Quantization can cause small regressions vs FP16 on tricky math/code or tight formatting. - For very long contexts and/or batching, the KV-cache can dominate memory—tune `maxtokens` and batch size. - Add your own safety/guardrails for sensitive deployments. You asked to assume and decide realistic ranges. The following are practical starting points for a single-device MLX run; validate on your hardware. Rule-of-thumb components - Weights (6-bit): ≈ `totalparams × 0.75 byte` → for 8B params ≈ ~6.0 GB
Josiefied-Qwen2.5-7B-Instruct-abliterated-v2
deepseek-coder-1.3b-instruct-mlx
Qwen2.5-Coder-32B-Instruct-8bit
Qwen2.5-VL-3B-Instruct-bf16
gemma-3-4b-it-4bit-DWQ
This model mlx-community/gemma-3-4b-it-4bit-DWQ was converted to MLX format from google/gemma-3-4b-it using mlx-lm version 0.24.0.
Qwen3-1.7B-8bit
Huihui-gemma-3n-E4B-it-abliterated-lm-6bit
mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-6bit The Model mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-6bit was converted to MLX format from huihui-ai/Huihui-gemma-3n-E4B-it-abliterated using mlx-lm version 0.26.4.
Qwen3-VL-2B-Instruct-8bit
mlx-community/Qwen3-VL-2B-Instruct-8bit This model was converted to MLX format from [`Qwen/Qwen3-VL-2B-Instruct`]() using mlx-vlm version 0.3.4. Refer to the original model card for more details on the model. Use with mlx
GLM-4-32B-0414-4bit-DWQ
granite-4.0-h-tiny-5bit-MLX
Josiefied-Qwen3-30B-A3B-abliterated-v2-8bit
Huihui-gemma-3n-E4B-it-abliterated-lm-4bit
mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-4bit The Model mlx-community/Huihui-gemma-3n-E4B-it-abliterated-lm-4bit was converted to MLX format from huihui-ai/Huihui-gemma-3n-E4B-it-abliterated using mlx-lm version 0.26.4.