lefromage
Qwen3-Next-80B-A3B-Instruct-GGUF
The qwennext PR (Pull Request #16095) was merged into the main branch and is in llama.cpp release b7186 The speed in tokens/second is decent and will be improved over time: Update: I have tested some of these smaller models on NVIDIA with default CUDA compile with the excellent release from @cturan on NVIDIA L40S GPU. Since L40S GPU is 48GB VRAM, I was able to run Q2K, Q3KM, Q4KS, Q40 and Q4MXFP4MOE: but Q4KM was too big. Although it works if using -ngl 45 but it slowed down quite a bit. There may be a better way but did not have time to test. Was able to get a good speed of 53 tokens per second in the generation and 800 tokens per second in the prompt reading. You may need to add /usr/local/cuda/bin to your PATH to find nvcc (Nvidia CUDA compiler) For more detail on CUDA build see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda These quantized models were generated using the excellent pull request from @pwilkin #16095 on 2025-10-19 with commit `2fdbf16eb`. NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development. Speed and quality should improve over time. by default will download lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4KM Quantum computing represents a revolutionary leap in computational power by harnessing the principles of quantum mechanics, such as superposition and entanglement, to process information in fundamentally new ways. Unlike classical computers, which use bits that are either 0 or 1, quantum computers use quantum bits, or qubits, which can exist in a combination of both states simultaneously. This allows quantum computers to explore vast solution spaces in parallel, making them potentially exponentially faster for certain problems—like factoring large numbers, optimizing complex systems, or simulating molecular structures for drug discovery. While still in its early stages, with challenges including qubit stability, error correction, and scalability, quantum computing holds transformative promise for fields ranging from cryptography to artificial intelligence. As researchers and tech companies invest heavily in hardware and algorithmic development, the race to achieve practical, fault-tolerant quantum machines is accelerating, heralding a new era in computing technology.
Qwen3-Next-80B-A3B-Thinking-GGUF
Update: I have tested some of these smaller models on NVIDIA with default CUDA compile with the excellent release from @cturan on NVIDIA L40S GPU. Since L40S GPU is 48GB VRAM, I was able to run Q2K, Q3KM, Q4KS, Q40 and Q4MXFP4MOE: but Q4KM was too big. Although it works if using -ngl 45 but it slowed down quite a bit. There may be a better way but did not have time to test. Was able to get a good speed of 53 tokens per second in the generation and 800 tokens per second in the prompt reading. You may need to add /usr/local/cuda/bin to your PATH to find nvcc (Nvidia CUDA compiler) For more detail on CUDA build see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda These quantized models were generated using the excellent pull request from @pwilkin #16095 on 2025-10-19 with commit `2fdbf16eb`. NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development. Speed and quality should improve over time. by default will download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF:Q4KM
Qwen3-Next-80B-A3B-Instruct-split-GGUF
Another way to download the Q2K quant model pieces: check https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Instruct-GGUF for more details currently getting 6 tokens per second for generation for simple prompt:
OlympicCoder-7B-Q2_K-GGUF
gemma-2-2b-it-Q4_0-GGUF
lefromage/gemma-2-2b-it-Q40-GGUF This model was converted to GGUF format from `google/gemma-2-2b-it` using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model. Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well. Step 2: Move into the llama.cpp folder and build it with `LLAMACURL=1` flag along with other hardware-specific flags (for ex: LLAMACUDA=1 for Nvidia GPUs on Linux).
gemma-2-27b-it-Q4_0-GGUF
gemma-2-2b-it-Q2_K-GGUF
gemma-2-9b-it-Q3_K_S-GGUF
gemma-3-27b-it-Q4_0-GGUF
lefromage/gemma-3-27b-it-Q40-GGUF This model was converted to GGUF format from `google/gemma-3-27b-it` using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model. Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well. Step 2: Move into the llama.cpp folder and build it with `LLAMACURL=1` flag along with other hardware-specific flags (for ex: LLAMACUDA=1 for Nvidia GPUs on Linux).
gemma-2-9b-it-Q2_K-GGUF
gemma-2-9b-it-Q4_0-GGUF
OlympicCoder-7B-Q8_0-GGUF
TinyLlama-1.1B-Chat-v1.0-Q4_K_M-GGUF
gemma-2-27b-it-Q2_K-GGUF
OlympicCoder-7B-Q4_0-GGUF
lefromage/OlympicCoder-7B-Q40-GGUF This model was converted to GGUF format from `open-r1/OlympicCoder-7B` using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model. Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well. Step 2: Move into the llama.cpp folder and build it with `LLAMACURL=1` flag along with other hardware-specific flags (for ex: LLAMACUDA=1 for Nvidia GPUs on Linux).
OlympicCoder-7B-Q4_K_M-GGUF
lefromage/OlympicCoder-7B-Q4KM-GGUF This model was converted to GGUF format from `open-r1/OlympicCoder-7B` using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model. Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well. Step 2: Move into the llama.cpp folder and build it with `LLAMACURL=1` flag along with other hardware-specific flags (for ex: LLAMACUDA=1 for Nvidia GPUs on Linux).