All Devices

40+ AI models run 4+ tok/s on Raspberry Pi 5 (4GB)

Tested all 2,499 GGUF models. Here's what actually works.

🟒

EXCELLENT PERFORMANCE

Real-time capable β€’ 4-6.5 tok/s

Llama 3.2-3B6.5 tok/s
Phi-3-mini6 tok/s
Qwen 2.5-3B5.5 tok/s

+ 40 more models

Cost vs Cloud AI

Raspberry Pi 4$60 USD
Cloud AI/month$20-200 USD

ROI in

3 months

Fast enough for:

βœ“

Real-time chat

Llama 3.2-3B at 6.5 tok/s

βœ“

Code review

Phi-3-mini at 6 tok/s

βœ“

Document Q&A

Qwen 2.5-3B at 5.5 tok/s

ℹ️

Raspberry Pi 4 vs Pi 5

Raspberry Pi 4 (4GB)

  • β€’ 6.5 tok/s peak (small models)
  • β€’ $60 USD (best value)
  • β€’ 2.5 tok/s for 7B models
  • β€’ Great for learning

Raspberry Pi 5 (8GB)

  • β€’ 7 tok/s peak (10% faster)
  • β€’ $80 USD
  • β€’ 4 tok/s for 7B models (60% faster)
  • β€’ Better for production

Recommendation: Pi 4 offers excellent value for experimentation. Upgrade to Pi 5 if you need faster 7B model inference or plan production deployments.

Top Picks for Raspberry Pi 5 (4GB)

Curated models optimized for your hardware - ready to run

Best Performance

Fastest

Llama 3.2-3B

Speed on Raspberry Pi 5 (4GB)
7 tok/s
Memory
3.2 GB
Best For

Real-time chat, voice assistants

Quick Setup
ollama run llama3.2:3b
Best Quality

Most Accurate

Qwen 2.5 (7B Q4)

Speed on Raspberry Pi 5 (4GB)
4 tok/s
Memory
5.1 GB
Best For

Document Q&A, analysis

Quick Setup
ollama run qwen2.5:7b-instruct-q4
Coding Expert

Best for Code

Phi-3-mini

Speed on Raspberry Pi 5 (4GB)
6.5 tok/s
Memory
2.8 GB
Best For

Code completion, debugging

Quick Setup
ollama run phi3:mini

Get Started with Edge AI

Hardware and cloud options for running models locally

RunPod

Rent GPU starting at $0.34/hour

Best Value

Deploy on cloud GPU or serverless. 70% cheaper than AWS.

Start from $0.34/hr

Amazon

Hardware for edge AI

Hardware

Get the devices you need to run models locally.

Shop Hardware

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.

All Compatible Models

Grouped by performance on Raspberry Pi 5 (4GB) β€’ 200 total models

Llama-3.2-1B-Instruct-GGUF

MaziyarPanahi

MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF - Model creator: meta-llama - Original model: meta-llama/Llama-3.2-1B-Instruct Description MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF contains GGUF format model files for meta-llama/Llama-3.2-1B-Instruct. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. πŸ™ Special thanks to Georgi Gerganov and the whole team working on llama.cpp for making all of this possible.

gemma-3-1b-it-GGUF

MaziyarPanahi

MaziyarPanahi/gemma-3-1b-it-GGUF - Model creator: google - Original model: google/gemma-3-1b-it Description MaziyarPanahi/gemma-3-1b-it-GGUF contains GGUF format model files for google/gemma-3-1b-it. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. πŸ™ Special thanks to Georgi Gerganov and the whole team working on llama.cpp for making all of this possible.

Yi-Coder-1.5B-Chat-GGUF

MaziyarPanahi

MaziyarPanahi/Yi-Coder-1.5B-Chat-GGUF - Model creator: 01-ai - Original model: 01-ai/Yi-Coder-1.5B-Chat Description MaziyarPanahi/Yi-Coder-1.5B-Chat-GGUF contains GGUF format model files for 01-ai/Yi-Coder-1.5B-Chat. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. πŸ™ Special thanks to Georgi Gerganov and the whole team working on llama.cpp for making all of this possible.

INTELLECT-2-GGUF

MaziyarPanahi

MaziyarPanahi/INTELLECT-2-GGUF - Model creator: PrimeIntellect - Original model: PrimeIntellect/INTELLECT-2 Description MaziyarPanahi/INTELLECT-2-GGUF contains GGUF format model files for PrimeIntellect/INTELLECT-2. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. πŸ™ Special thanks to Georgi Gerganov and the whole team working on llama.cpp for making all of this possible.

Osmosis-Structure-0.6B

osmosis-ai

`Osmosis-Structure-0.6B`: Small Language Model for Structured Outputs `Osmosis-Structure-0.6B` is a specialized small language model (SLM) designed to excel at structured output generation. Despite its compact 0.6B parameter size, this model demonstrates remarkable performance on extracting structured information when paired with supported frameworks. Our approach leverages structured output during training, forcing our model to only focus on the value for each key declared by the inference engine, which significantly improves the accuracy of the model's ability to produce well-formatted, structured responses across various domains, particularly in mathematical reasoning and problem-solving tasks. We evaluate the effectiveness of osmosis-enhanced structured generation on challenging mathematical reasoning benchmarks. The following results demonstrate the dramatic performance improvements achieved through structured outputs with osmosis enhancement across different model families - the same technique that powers `Osmosis-Structure-0.6B`. | Model | Structured Output | Structured w/ Osmosis | Performance Gain | |-------|:-------------:|:-------------:|:-------------:| | Claude 4 Sonnet | 15.52% | 69.40% | +347% | | Claude 4 Opus | 15.28% | 69.91% | +357% | | GPT-4.1 | 10.53% | 70.03% | +565% | | OpenAI o3 | 91.14% | 94.05% | +2.9% | | Model | Structured Output | Structured w/ Osmosis | Performance Gain | |-------|:-------------:|:-------------:|:-------------:| | Claude 4 Sonnet | 16.29% | 62.59% | +284% | | Claude 4 Opus | 22.94% | 65.06% | +184% | | GPT-4.1 | 2.79% | 39.66% | +1322% | | OpenAI o3 | 92.05% | 93.24% | +1.3% | > Key Insight: These results demonstrate that by allowing models to think freely and leverage test time compute, we are able to increase performance and still maintain the structured guarantee after the fact with a SLM. `Osmosis-Structure-0.6B` is specifically designed and optimized to maximize these benefits in a compact 0.6B parameter model. `Osmosis-Structure-0.6B` is built on top of `Qwen3-0.6B`. We first established a baseline format using 10 samples of randomly generated text and their JSON interpretations. We then applied reinforcement learning to approximately 500,000 examples of JSON-to-natural language pairs, consisting of either reasoning traces with their final outputs, or natural language reports with their expected structured formats. We used verl as the framework to train our model and SGLang as the rollout backend. To enable structured training, we modified parts of the verl codebase to allow for per sample schema to be passed into the training data. We recommend an engine like SGLang to be used to serve the model, to serve, run the following: `python3 -m sglang.launchserver --model-path osmosis-ai/Osmosis-Structure-0.6B --host 0.0.0.0 --api-key osmosis` You can also use Ollama as an inference provider on local machines, here is a sample code of the setup:

Qwen3-1.7B-GGUF

unsloth

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▢️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▢️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▢️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▢️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▢️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▢️ Start on Colab | 2x faster | 50% less | To Switch Between Thinking and Non-Thinking If you are using llama.cpp, Ollama, Open WebUI etc., you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-1.7B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 1.7B - Number of Paramaters (Non-Embedding): 1.4B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-1.7B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-1.7B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-1.7B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

phi-2-GGUF

TheBloke

TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z) Phi 2 - GGUF - Model creator: Microsoft - Original model: Phi 2 This repo contains GGUF format model files for Microsoft's Phi 2. GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. Here is an incomplete list of clients and libraries that are known to support GGUF: llama.cpp. The source project for GGUF. Offers a CLI and a server option. text-generation-webui, the most widely used web UI, with many features and powerful extensions. Supports GPU acceleration. KoboldCpp, a fully featured web UI, with GPU accel across all platforms and GPU architectures. Especially good for story telling. GPT4All, a free and open source local running GUI, supporting Windows, Linux and macOS with full GPU accel. LM Studio, an easy-to-use and powerful local GUI for Windows and macOS (Silicon), with GPU acceleration. Linux available, in beta as of 27/11/2023. LoLLMS Web UI, a great web UI with many interesting and unique features, including a full model library for easy model selection. Faraday.dev, an attractive and easy to use character-based chat GUI for Windows and macOS (both Silicon and Intel), with GPU acceleration. llama-cpp-python, a Python library with GPU accel, LangChain support, and OpenAI-compatible API server. candle, a Rust ML framework with a focus on performance, including GPU support, and ease of use. ctransformers, a Python library with GPU accel, LangChain support, and OpenAI-compatible AI server. Note, as of time of writing (November 27th 2023), ctransformers has not been updated in a long time and does not support many recent models. GPTQ models for GPU inference, with multiple quantisation parameter options. 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference Microsoft's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions These quantised GGUFv2 files are compatible with llama.cpp from August 27th onwards, as of commit d0cee0d They are also compatible with many third party UIs and libraries - please see the list at the top of this README. GGMLTYPEQ2K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw) GGMLTYPEQ3K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw. GGMLTYPEQ4K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw. GGMLTYPEQ5K - "type-1" 5-bit quantization. Same super-block structure as GGMLTYPEQ4K resulting in 5.5 bpw GGMLTYPEQ6K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw Refer to the Provided Files table below to see what files use which methods, and how. | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | phi-2.Q2K.gguf | Q2K | 2 | 1.17 GB| 3.67 GB | smallest, significant quality loss - not recommended for most purposes | | phi-2.Q3KS.gguf | Q3KS | 3 | 1.25 GB| 3.75 GB | very small, high quality loss | | phi-2.Q3KM.gguf | Q3KM | 3 | 1.48 GB| 3.98 GB | very small, high quality loss | | phi-2.Q40.gguf | Q40 | 4 | 1.60 GB| 4.10 GB | legacy; small, very high quality loss - prefer using Q3KM | | phi-2.Q3KL.gguf | Q3KL | 3 | 1.60 GB| 4.10 GB | small, substantial quality loss | | phi-2.Q4KS.gguf | Q4KS | 4 | 1.62 GB| 4.12 GB | small, greater quality loss | | phi-2.Q4KM.gguf | Q4KM | 4 | 1.79 GB| 4.29 GB | medium, balanced quality - recommended | | phi-2.Q50.gguf | Q50 | 5 | 1.93 GB| 4.43 GB | legacy; medium, balanced quality - prefer using Q4KM | | phi-2.Q5KS.gguf | Q5KS | 5 | 1.93 GB| 4.43 GB | large, low quality loss - recommended | | phi-2.Q5KM.gguf | Q5KM | 5 | 2.07 GB| 4.57 GB | large, very low quality loss - recommended | | phi-2.Q6K.gguf | Q6K | 6 | 2.29 GB| 4.79 GB | very large, extremely low quality loss | | phi-2.Q80.gguf | Q80 | 8 | 2.96 GB| 5.46 GB | very large, extremely low quality loss - not recommended | Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file. The following clients/libraries will automatically download models for you, providing a list of available models to choose from: Under Download Model, you can enter the model repo: TheBloke/phi-2-GGUF and below it, a specific filename to download, such as: phi-2.Q4KM.gguf. On the command line, including multiple files at once I recommend using the `huggingface-hub` Python library: Then you can download any individual model file to the current directory, at high speed, with a command like this: More advanced huggingface-cli download usage (click to read) You can also download multiple files at once with a pattern: For more documentation on downloading with `huggingface-cli`, please see: HF -> Hub Python Library -> Download files -> Download from the CLI. To accelerate downloads on fast connections (1Gbit/s or higher), install `hftransfer`: And set environment variable `HFHUBENABLEHFTRANSFER` to `1`: Windows Command Line users: You can set the environment variable by running `set HFHUBENABLEHFTRANSFER=1` before the download command. Make sure you are using `llama.cpp` from commit d0cee0d or later. Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 2048` to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` For other parameters and how to use them, please refer to the llama.cpp documentation Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md. You can use GGUF models from Python using the llama-cpp-python or ctransformers libraries. Note that at the time of writing (Nov 27th 2023), ctransformers has not been updated for some time and is not compatible with some recent models. Therefore I recommend you use llama-cpp-python. How to load this model in Python code, using llama-cpp-python For full documentation, please see: llama-cpp-python docs. Run one of the following commands, according to your system: Here are guides on using llama-cpp-python and ctransformers with LangChain: LangChain + llama-cpp-python LangChain + ctransformers For further support, and discussions on these models and AI in general, join us at: I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training. If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects. Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits. Patreon: https://patreon.com/TheBlokeAI Ko-Fi: https://ko-fi.com/TheBlokeAI Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, SX, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik BjΓ€reholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros And thank you again to a16z for their generous grant. Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as Phi-1.5, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. Our model hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more. Phi-2 is intended for research purposes only. Given the nature of the training data, the Phi-2 model is best suited for prompts using the QA format, the chat format, and the code format. You can provide the prompt as a standalone question as follows: where the model generates the text after "." . To encourage the model to write more concise answers, you can also try the following QA format using "Instruct: \ \nOutput:" where the model generates the text after "Output:". where the model generates the text after the first "Bob:". where the model generates the text after the comments. Notes: Phi-2 is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications. Direct adoption for production tasks is out of the scope of this research project. As a result, the Phi-2 model has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details. If you are using `transformers>=4.36.0`, always load the model with `trustremotecode=True` to prevent side-effects. To ensure the maximum compatibility, we recommend using the second execution mode (FP16 / CUDA), as follows: Remark: In the generation function, our model currently does not support beam search (`numbeams > 1`). Furthermore, in the forward pass of the model, we currently do not support outputting hidden states or attention values, or using custom input embeddings. Generate Inaccurate Code and Facts: The model may produce incorrect code snippets and statements. Users should treat these outputs as suggestions or starting points, not as definitive or accurate solutions. Limited Scope for code: Majority of Phi-2 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users. Language Limitations: The model is primarily designed to understand standard English. Informal English, slang, or any other languages might pose challenges to its comprehension, leading to potential misinterpretations or errors in response. Potential Societal Biases: Phi-2 is not entirely free from societal biases despite efforts in assuring trainig data safety. There's a possibility it may generate content that mirrors these societal biases, particularly if prompted or instructed to do so. We urge users to be aware of this and to exercise caution and critical thinking when interpreting model outputs. Toxicity: Despite being trained with carefully selected data, the model can still produce harmful content if explicitly prompted or instructed to do so. We chose to release the model for research purposes only -- We hope to help the open-source community develop the most effective ways to reduce the toxicity of a model directly after pretraining. Verbosity: Phi-2 being a base model often produces irrelevant or extra text and responses following its first answer to user prompts within a single turn. This is due to its training dataset being primarily textbooks, which results in textbook-like responses. Architecture: a Transformer-based model with next-word prediction objective Dataset size: 250B tokens, combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4. The model is licensed under the microsoft-research-license. This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Qwen3-1.7B-GGUF

Qwen

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-1.7B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 1.7B - Number of Paramaters (Non-Embedding): 1.4B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Check out our ollama documentation for more usage guide. You can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - For thinking mode (`enablethinking=True`), use `Temperature=0.6`, `TopP=0.95`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions. - For non-thinking mode (`enablethinking=False`), we suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, `MinP=0`, and `PresencePenalty=1.5`. - We recommend setting `presencepenalty` to 1.5 for quantized models to suppress repetitive outputs. You can adjust the `presencepenalty` parameter between 0 and 2. A higher value may occasionally lead to language mixing and a slight reduction in model performance. 2. Adequate Output Length: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 38,912 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." 4. No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed. If you find our work helpful, feel free to give us a cite.

Qwen2.5-0.5B-Instruct-GGUF

Qwen

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 0.5B Qwen2.5 model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this πŸ“‘ blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

Qwen2.5-1.5B-Instruct-GGUF

Qwen

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 1.5B Qwen2.5 model in the GGUF Format, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 1.54B - Number of Paramaters (Non-Embedding): 1.31B - Number of Layers: 28 - Number of Attention Heads (GQA): 12 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens - Quantization: q2K, q3KM, q40, q4KM, q50, q5KM, q6K, q80 For more details, please refer to our blog, GitHub, and Documentation. Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. Since cloning the entire repo may be inefficient, you can manually download the GGUF file that you need or use `huggingface-cli`: 1. Install For users, to achieve chatbot-like experience, it is recommended to commence in the conversation mode: Detailed evaluation results are reported in this πŸ“‘ blog. For quantized models, the benchmark results against the original bfloat16 models can be found here For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

Qwen3-0.6B-GGUF

unsloth

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▢️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▢️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▢️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▢️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▢️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▢️ Start on Colab | 2x faster | 50% less | To Switch Between Thinking and Non-Thinking If you are using llama.cpp, Ollama, Open WebUI etc., you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-0.6B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-0.6B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

Llama-3.2-1B-Instruct-GGUF

unsloth

See our collection for all versions of Llama 3.2 including GGUF, 4-bit and original 16-bit formats. 16bit, 8bit, 6bit, 5bit, 4bit, 3bit and 2bit uploads avaliable. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1T5-zKWM5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing unsloth/Llama-3.2-1B-Instruct For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▢️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▢️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▢️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▢️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▢️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▢️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▢️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

gemma-3n-E2B-it-GGUF

unsloth

Learn how to run & fine-tune Gemma 3n correctly - Read our Guide . See our collection for all versions of Gemma 3n including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves SOTA accuracy & performance versus other quants. - Currently only text is supported. - Ollama: `ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4KXL` - auto-sets correct chat template and settings - Set temperature = 1.0, topk = 64, topp = 0.95, minp = 0.0 - Gemma 3n max tokens (context length): 32K. Gemma 3n chat template: - For complete detailed instructions, see our step-by-step guide. - Fine-tune Gemma 3n (4B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3n support: unsloth.ai/blog/gemma-3n - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma-3n-E4B | ▢️ Start on Colab | 2x faster | 60% less | | GRPO with Gemma 3 (1B) | ▢️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Gemma 3 (4B) Vision | ▢️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen3 (14B) | ▢️ Start on Colab-Reasoning-Conversational.ipynb) | 2x faster | 60% less | | DeepSeek-R1-0528-Qwen3-8B (14B) | ▢️ Start on ColabGRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▢️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | - Responsible Generative AI Toolkit - Gemma on Kaggle - Gemma on HuggingFace - Gemma on Vertex Model Garden Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages. Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each - Audio data encoded to 6.25 tokens per second from a single channel - Total input context of 32K tokens - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output length up to 32K tokens, subtracting the request input tokens Usage Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3n is supported starting from transformers 4.53.0. Then, copy the snippet from the section that is relevant for your use case. You can initialize the model and processor for inference with `pipeline` as follows. With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline. Data used for model training and how the data was processed. These models were trained on a dataset that includes a wide variety of sources totalling approximately 11 trillion tokens. The knowledge cutoff date for the training data was June 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. - Audio: A diverse set of sound samples enables the model to recognize speech, transcribe text from recordings, and identify information in audio data. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with our policies. Implementation Information Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training generative models requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training generative models. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. These advantages are aligned with Google's commitments to operate sustainably. Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models: "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated at full precision (float32) against a large collection of different datasets and metrics to cover different aspects of content generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | Metric | n-shot | E2B PT | E4B PT | | ------------------------------ |----------------|----------|:--------:|:--------:| | [HellaSwag][hellaswag] | Accuracy | 10-shot | 72.2 | 78.6 | | [BoolQ][boolq] | Accuracy | 0-shot | 76.4 | 81.6 | | [PIQA][piqa] | Accuracy | 0-shot | 78.9 | 81.0 | | [SocialIQA][socialiqa] | Accuracy | 0-shot | 48.8 | 50.0 | | [TriviaQA][triviaqa] | Accuracy | 5-shot | 60.8 | 70.2 | | [Natural Questions][naturalq] | Accuracy | 5-shot | 15.5 | 20.9 | | [ARC-c][arc] | Accuracy | 25-shot | 51.7 | 61.6 | | [ARC-e][arc] | Accuracy | 0-shot | 75.8 | 81.6 | | [WinoGrande][winogrande] | Accuracy | 5-shot | 66.8 | 71.7 | | [BIG-Bench Hard][bbh] | Accuracy | few-shot | 44.3 | 52.9 | | [DROP][drop] | Token F1 score | 1-shot | 53.9 | 60.8 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------|-------------------------|----------|:--------:|:--------:| | [MGSM][mgsm] | Accuracy | 0-shot | 53.1 | 60.7 | | [WMT24++][wmt24pp] (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 | | [Include][include] | Accuracy | 0-shot | 38.6 | 57.2 | | [MMLU][mmlu] (ProX) | Accuracy | 0-shot | 8.1 | 19.9 | | [OpenAI MMLU][openai-mmlu] | Accuracy | 0-shot | 22.3 | 35.6 | | [Global-MMLU][global-mmlu] | Accuracy | 0-shot | 55.1 | 60.3 | | [ECLeKTic][eclektic] | ECLeKTic score | 0-shot | 2.5 | 1.9 | [mgsm]: https://arxiv.org/abs/2210.03057 [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [include]:https://arxiv.org/abs/2411.19799 [mmlu]: https://arxiv.org/abs/2009.03300 [openai-mmlu]: https://huggingface.co/datasets/openai/MMMLU [global-mmlu]: https://huggingface.co/datasets/CohereLabs/Global-MMLU [eclektic]: https://arxiv.org/abs/2502.21228 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------|--------------------------|----------|:--------:|:--------:| | [GPQA][gpqa] Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 | | [LiveCodeBench][lcb] v5 | pass@1 | 0-shot | 18.6 | 25.7 | | Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 | | [AIME 2025][aime-2025] | Accuracy | 0-shot | 6.7 | 11.6 | [gpqa]: https://arxiv.org/abs/2311.12022 [lcb]: https://arxiv.org/abs/2403.07974 [aime-2025]: https://www.vals.ai/benchmarks/aime-2025-05-09 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------ |------------|----------|:--------:|:--------:| | [MMLU][mmlu] | Accuracy | 0-shot | 60.1 | 64.9 | | [MBPP][mbpp] | pass@1 | 3-shot | 56.6 | 63.6 | | [HumanEval][humaneval] | pass@1 | 0-shot | 66.5 | 75.0 | | [LiveCodeBench][lcb] | pass@1 | 0-shot | 13.2 | 13.2 | | HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 | | [Global-MMLU-Lite][global-mmlu-lite] | Accuracy | 0-shot | 59.0 | 64.5 | | [MMLU][mmlu] (Pro) | Accuracy | 0-shot | 40.5 | 50.6 | [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to high severity violations. A limitation of our evaluations was they included primarily English language prompts. These models have certain limitations that users should be aware of. Open generative models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: Extract, interpret, and summarize visual data for text communications. - Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data. - Research and Education - Natural Language Processing (NLP) and generative model Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics. Limitations - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. Ethical Considerations and Risks The development of generative models raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - Generative models trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - Generative models can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making generative model technology accessible to developers and researchers across the AI ecosystem. Risks identified and mitigations: - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of generative models. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. Benefits At the time of release, this family of models provides high-performance open generative model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives.

Kimi-K2-Thinking-GGUF

unsloth

Nov 8: We collabed with the Kimi team on a system prompt fix. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. It is recommended to have 247 GB of RAM to run the 1-bit Dynamic GGUF. To run the model in full precision, you can use 'UD-Q4KXL', which requires 646 GB RAM. Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. Key Features - Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. - Native INT4 Quantization: Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. - Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps. | | | |:---:|:---:| | Architecture | Mixture-of-Experts (MoE) | | Total Parameters | 1T | | Activated Parameters | 32B | | Number of Layers (Dense layer included) | 61 | | Number of Dense Layers | 1 | | Attention Hidden Dimension | 7168 | | MoE Hidden Dimension (per Expert) | 2048 | | Number of Attention Heads | 64 | | Number of Experts | 384 | | Selected Experts per Token | 8 | | Number of Shared Experts | 1 | | Vocabulary Size | 160K | | Context Length | 256K | | Attention Mechanism | MLA | | Activation Function | SwiGLU | Reasoning Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:| | HLE (Text-only) | no tools | 23.9 | 26.3 | 19.8 | 7.9 | 19.8 | 25.4 | | | w/ tools | 44.9 | 41.7 | 32.0 | 21.7 | 20.3 | 41.0 | | | heavy | 51.0 | 42.0 | - | - | - | 50.7 | | AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 | | | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1 | 98.8 | | | heavy | 100.0 | 100.0 | - | - | - | 100.0 | | HMMT25 | no tools | 89.4 | 93.3 | 74.6 | 38.8 | 83.6 | 90.0 | | | w/ python | 95.1 | 96.7 | 88.8 | 70.4 | 49.5 | 93.9 | | | heavy | 97.5 | 100.0 | - | - | - | 96.7 | | IMO-AnswerBench | no tools | 78.6 | 76.0 | 65.9 | 45.8 | 76.0 | 73.1 | | GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 | General Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:| | MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | | MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | | Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | | HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | Agentic Search Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:| | BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | | BrowseComp-ZH | w/ tools | 62.3 | 63.0 | 42.4 | 22.2 | 47.9 | | Seal-0 | w/ tools | 56.3 | 51.4 | 53.4 | 25.2 | 38.5 | | FinSearchComp-T3 | w/ tools | 47.4 | 48.5 | 44.0 | 10.4 | 27.0 | | Frames | w/ tools | 87.0 | 86.0 | 85.0 | 58.1 | 80.2 | Coding Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:| | SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | | SWE-bench Multilingual | w/ tools | 61.1 | 55.3 | 68.0 | 55.9 | 57.9 | | Multi-SWE-bench | w/ tools | 41.9 | 39.3 | 44.3 | 33.5 | 30.6 | | SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | | LiveCodeBenchV6 | no tools | 83.1 | 87.0 | 64.0 | 56.1 | 74.1 | | OJ-Bench (cpp) | no tools | 48.7 | 56.2 | 30.4 | 25.5 | 38.2 | | Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | 1. To ensure a fast, lightweight experience, we selectively employ a subset of tools and reduce the number of tool call steps under the chat mode on kimi.com. As a result, chatting on kimi.com may not reproduce our benchmark scores. Our agentic mode will be updated soon to reflect the full capabilities of K2 Thinking. 2. Testing Details: 2.1. All benchmarks were evaluated at temperature = 1.0 and 256 k context length for K2 Thinking, except for SciCode, for which we followed the official temperature setting of 0.0. 2.2. HLE (no tools), AIME25, HMMT25, and GPQA were capped at a 96k thinking-token budget, while IMO-Answer Bench, LiveCodeBench and OJ-Bench were capped at a 128k thinking-token budget. Longform Writing was capped at a 32k completion-token budget. 2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8). 3. Baselines: 3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the GPT-5 post, GPT-5 for Developers post, GPT-5 system card, claude-sonnet-4-5 post, grok-4 post, deepseek-v3.2 post, the public Terminal-Bench leaderboard (Terminus-2), the public Vals AI leaderboard and artificialanalysis. Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(). For the GPT-5 test, we set the reasoning effort to high. 3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4’s launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from Scale.ai . The official GPT5 HLE full set w/o tool is 24.8. 3.3 For IMO-AnswerBench : GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76. 4. For HLE (w/ tools) and the agentic-search benchmarks: 4.1. K2 Thinking was equipped with search, code-interpreter, and web-browsing tools. 4.2. BrowseComp-ZH, Seal-0 and FinSearchComp-T3 were run 4 times independently and the average is reported (avg@4). 4.3. The evaluation used o3-mini as judge, configured identically to the official HLE setting; judge prompts were taken verbatim from the official repository. 4.4. On HLE, the maximum step limit was 120, with a 48 k-token reasoning budget per step; on agentic-search tasks, the limit was 300 steps with a 24 k-token reasoning budget per step. 4.5. When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs. 4.6. The web access to Hugging Face may lead to data leakage in certain benchmark tests, such as HLE. K2 Thinking can achieve a score of 51.3 on HLE without blocking Hugging Face. To ensure a fair and rigorous comparison, we blocked access to Hugging Face during testing. 5. For Coding Tasks: 5.1. Terminal-Bench scores were obtained with the default agent framework (Terminus-2) and the provided JSON parser. 5.2. For other coding tasks, the result was produced with our in-house evaluation harness. The harness is derived from SWE-agent, but we clamp the context windows of the Bash and Edit tools and rewrite the system prompt to match the task semantics. 5.3. All reported scores of coding tasks are averaged over 5 independent runs. 6. Heavy Mode: K2 Thinking Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result. Heavy mode for GPT-5 denotes the official GPT-5 Pro score. Low-bit quantization is an effective way to reduce inference latency and GPU memory usage on large-scale inference servers. However, thinking models use excessive decoding lengths, and thus quantization often results in substantial performance drops. To overcome this challenge, we adopt Quantization-Aware Training (QAT) during the post-training phase, applying INT4 weight-only quantization to the MoE components. It allows K2 Thinking to support native INT4 inference with a roughly 2x generation speed improvement while achieving state-of-the-art performance. All benchmark results are reported under INT4 precision. The checkpoints are saved in compressed-tensors format, supported by most of mainstream inference engine. If you need the checkpoints in higher precision such as FP8 or BF16, you can refer to official repo of compressed-tensors to unpack the int4 weights and convert to any higher precision. 5. Deployment > [!Note] > You can access K2 Thinking's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you. Currently, Kimi-K2-Thinking is recommended to run on the following inference engines: Deployment examples can be found in the Model Deployment Guide. Once the local inference service is up, you can interact with it through the chat endpoint: > [!NOTE] > The recommended temperature for Kimi-K2-Thinking is `temperature = 1.0`. > If no special instructions are required, the system prompt above is a good default. Kimi-K2-Thinking has the same tool calling settings as Kimi-K2-Instruct. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them. The following example demonstrates calling a weather tool end-to-end: The `toolcallwithclient` function implements the pipeline from user query to tool execution. This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic. For more information, see the Tool Calling Guide. Both the code repository and the model weights are released under the Modified MIT License. If you have any questions, please reach out at [email protected].

DeepSeek-R1-Distill-Qwen-1.5B-GGUF

bartowski

Llamacpp imatrix Quantizations of DeepSeek-R1-Distill-Qwen-1.5B Original model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B All quants made using imatrix option with dataset from here | Filename | Quant type | File Size | Split | Description | | -------- | ---------- | --------- | ----- | ----------- | | DeepSeek-R1-Distill-Qwen-1.5B-f32.gguf | f32 | 7.11GB | false | Full F32 weights. | | DeepSeek-R1-Distill-Qwen-1.5B-f16.gguf | f16 | 3.56GB | false | Full F16 weights. | | DeepSeek-R1-Distill-Qwen-1.5B-Q80.gguf | Q80 | 1.89GB | false | Extremely high quality, generally unneeded but max available quant. | | DeepSeek-R1-Distill-Qwen-1.5B-Q6KL.gguf | Q6KL | 1.58GB | false | Uses Q80 for embed and output weights. Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q6K.gguf | Q6K | 1.46GB | false | Very high quality, near perfect, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KL.gguf | Q5KL | 1.43GB | false | Uses Q80 for embed and output weights. High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KM.gguf | Q5KM | 1.29GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KL.gguf | Q4KL | 1.29GB | false | Uses Q80 for embed and output weights. Good quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q5KS.gguf | Q5KS | 1.26GB | false | High quality, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KXL.gguf | Q3KXL | 1.18GB | false | Uses Q80 for embed and output weights. Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-1.5B-Q41.gguf | Q41 | 1.16GB | false | Legacy format, similar performance to Q4KS but with improved tokens/watt on Apple silicon. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KM.gguf | Q4KM | 1.12GB | false | Good quality, default size for most use cases, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q4KS.gguf | Q4KS | 1.07GB | false | Slightly lower quality with more space savings, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q40.gguf | Q40 | 1.07GB | false | Legacy format, offers online repacking for ARM and AVX CPU inference. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ4NL.gguf | IQ4NL | 1.07GB | false | Similar to IQ4XS, but slightly larger. Offers online repacking for ARM CPU inference. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ4XS.gguf | IQ4XS | 1.02GB | false | Decent quality, smaller than Q4KS with similar performance, recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KL.gguf | Q3KL | 0.98GB | false | Lower quality but usable, good for low RAM availability. | | DeepSeek-R1-Distill-Qwen-1.5B-Q2KL.gguf | Q2KL | 0.98GB | false | Uses Q80 for embed and output weights. Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KM.gguf | Q3KM | 0.92GB | false | Low quality. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ3M.gguf | IQ3M | 0.88GB | false | Medium-low quality, new method with decent performance comparable to Q3KM. | | DeepSeek-R1-Distill-Qwen-1.5B-Q3KS.gguf | Q3KS | 0.86GB | false | Low quality, not recommended. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ3XS.gguf | IQ3XS | 0.83GB | false | Lower quality, new method with decent performance, slightly better than Q3KS. | | DeepSeek-R1-Distill-Qwen-1.5B-Q2K.gguf | Q2K | 0.75GB | false | Very low quality but surprisingly usable. | | DeepSeek-R1-Distill-Qwen-1.5B-IQ2M.gguf | IQ2M | 0.70GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. | Some of these quants (Q3KXL, Q4KL etc) are the standard quantization method with the embeddings and output weights quantized to Q80 instead of what they would normally default to. First, make sure you have hugginface-cli installed: If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run: You can either specify a new local-dir (DeepSeek-R1-Distill-Qwen-1.5B-Q80) or download them all in place (./) Previously, you would download Q4044/48/88, and these would have their weights interleaved in memory in order to improve performance on ARM and AVX machines by loading up more data in one pass. Now, however, there is something called "online repacking" for weights. details in this PR. If you use Q40 and your hardware would benefit from repacking weights, it will do it automatically on the fly. As of llama.cpp build b4282 you will not be able to run the Q40XX files and will instead need to use Q40. Additionally, if you want to get slightly better quality for , you can use IQ4NL thanks to this PR which will also repack the weights for ARM, though only the 44 for now. The loading time may be slower but it will result in an overall speed incrase. I'm keeping this section to show the potential theoretical uplift in performance from using the Q40 with online repacking. Click to view benchmarks on an AVX2 system (EPYC7702) | model | size | params | backend | threads | test | t/s | % (vs Q40) | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 Β± 1.03 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 Β± 0.19 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 Β± 0.44 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 Β± 0.27 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 Β± 0.69 | 100% | | qwen2 3B Q40 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 Β± 0.03 | 100% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 Β± 1.74 | 147% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 Β± 0.20 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 Β± 1.81 | 101% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 Β± 0.99 | 48% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 Β± 3.04 | 83% | | qwen2 3B Q4KM | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 Β± 3.59 | 90% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 Β± 3.53 | 133% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 Β± 45.63 | 100% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 Β± 5.00 | 124% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 Β± 0.05 | 111% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 Β± 0.09 | 110% | | qwen2 3B Q4088 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 Β± 0.31 | 105% | Q4088 offers a nice bump to prompt processing and a small bump to text generation A great write up with charts showing various performances is provided by Artefact2 here The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have. If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM. If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total. Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'. If you don't want to think too much, grab one of the K-quants. These are in format 'QXKX', like Q5KM. If you want to get more into the weeds, you can check out this extremely useful feature chart: But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQXX, like IQ3M. These are newer and offer better performance for their size. These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide. The I-quants are not compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm. Thank you kalomaze and Dampf for assistance in creating the imatrix calibration dataset. Thank you ZeroWw for the inspiration to experiment with embed/output. Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski

Llama-OuteTTS-1.0-1B-GGUF

OuteAI

> [!IMPORTANT] > Important Sampling Considerations > > When using OuteTTS version 1.0, it is crucial to use the settings specified in the Sampling Configuration section. > The repetition penalty implementation is particularly important - this model requires penalization applied to a 64-token recent window, > rather than across the entire context window. Penalizing the entire context will cause the model to produce broken or low-quality output. > > To address this limitation, all necessary samplers and patches for all backends are set up automatically in the outetts library. > If using a custom implementation, ensure you correctly implement these requirements. This update brings significant improvements in speech synthesis and voice cloningβ€”delivering a more powerful, accurate, and user-friendly experience in a compact size. 1. Prompt Revamp & Dependency Removal - Automatic Word Alignment: The model now performs word alignment internally. Simply input raw textβ€”no pre-processing requiredβ€”and the model handles the rest, streamlining your workflow. For optimal results, use normalized, readable text without newlines (light normalization is applied automatically in outetts library). - Native Multilingual Text Support: Direct support for native text across multiple languages eliminates the need for romanization. - Enhanced Metadata Integration: The updated prompt system incorporates additional metadata (time, energy, spectral centroid, pitch) at both global and word levels, improving speaker flow and synthesis quality. - Special Tokens for Audio Codebooks: New tokens for c1 (codebook 1) and c2 (codebook 2). 2. New Audio Encoder Model - DAC Encoder: Integrates a DAC audio encoder from ibm-research/DAC.speech.v1.0, utilizing two codebooks for high quality audio reconstruction. - Performance Trade-off: Improved audio fidelity increases the token generation rate from 75 to 150 tokens per second. This trade-off prioritizes quality, especially for multilingual applications. 3. Voice Cloning - One-Shot Voice Cloning: To achieve one-shot cloning, the model typically requires only around 10 seconds of reference audio to produce an accurate voice representation. - Improved Accuracy: Enhanced by the new encoder and additional training metadata, voice cloning is now more natural and precise. 4. Auto Text Alignment & Numerical Support - Automatic Text Alignment: Aligns raw text at the word level, even for languages without clear boundaries (e.g., Japanese, Chinese), using insights from pre-processed training data. - Direct Numerical Input: Built-in multilingual numerical support allows direct use of numbers in promptsβ€”no textual conversion needed. (The model typically chooses the dominant language present. Mixing languages in a single prompt may lead to mistakes.) - Supported Languages: OuteTTS offers varying proficiency levels across languages, based on training data exposure. - High Training Data Languages: These languages feature extensive training: English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Lithuanian, Russian, Spanish - Moderate Training Data Languages: These languages received moderate training, offering good performance with occasional limitations: Portuguese, Belarusian, Bengali, Georgian, Hungarian, Latvian, Persian/Farsi, Polish, Swahili, Tamil, Ukrainian - Beyond Supported Languages: The model can generate speech in untrained languages with varying success. Experiment with unlisted languages, though results may not be optimal. More Configuration Options For advanced settings and customization, visit the official repository: πŸ”— interfaceusage.md Speaker Reference The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower-quality outputs. The model inherits the referenced speaker's emotion, style, and accent. When transcribing to other languages with the same speaker, you may observe the model retaining the original accent. Multilingual Application It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features. While the model supports cross-lingual speech, it still relies on the reference speaker. If the speaker has a distinct accentβ€”such as British Englishβ€”other languages may carry that accent as well. Optimal Audio Length - Best Performance: Generate audio around 42 seconds in a single run (approximately 8,192 tokens). It is recomended not to near the limits of this windows when generating. Usually, the best results are up to 7,000 tokens. - Context Reduction with Speaker Reference: If the speaker reference is 10 seconds long, the effective context is reduced to approximately 32 seconds. Temperature Setting Recommendations Testing shows that a temperature of 0.4 is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication. Verifying Speaker Encoding If the cloned voice quality is subpar, check the encoded speaker sample. The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality. Sampling Configuration For optimal results with this TTS model, use the following sampling settings. | Parameter | Value | |-------------------|----------| | Temperature | 0.4 | | Repetition Penalty| 1.1 | | Repetition Range | 64 | | Top-k | 40 | | Top-p | 0.9 | | Min-p | 0.05 | - Training Data: Trained on ~60k hours of audio - Context Length: Supports a maximum context window of 8,192 tokens Pre-Training - Optimizer: AdamW - Batch Size: 1 million tokens - Max Learning Rate: 3e-4 - Min Learning Rate: 3e-5 - Context Length: 8192 Fine-Tuning - Optimizer: AdamW - Max Learning Rate: 1e-5 - Min Learning Rate: 5e-6 - Data: 10,000 diverse, high-quality examples - Initial Llama3.2 Components: Llama 3.2 Community License Agreement - Our Continued Pre-Training, Fine-Tuning, and Additional Components: CC-BY-NC-SA-4.0 - Big thanks to Hugging Face for their continued resource support through their grant program! - Audio encoding and decoding utilize ibm-research/DAC.speech.v1.0 - OuteTTS is built with Llama3.2-1B as the base model, with continued pre-training and fine-tuning. Ethical Use Guidelines This text-to-speech model is intended for legitimate applications that enhance accessibility, creativity, and communication; prohibited uses include impersonation without consent, creation of deliberately misleading content, generation of harmful or harassing material, distribution of synthetic audio without proper disclosure, voice cloning without permission, and any uses that violate applicable laws, regulations, or copyrights.

Qwen3-Embedding-0.6B-GGUF

Qwen

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model. The Qwen3 Embedding series represents significant advancements in multiple text embedding and ranking tasks, including text retrieval, code retrieval, text classification, text clustering, and bitext mining. Exceptional Versatility: The embedding model has achieved state-of-the-art performance across a wide range of downstream application evaluations. The 8B size embedding model ranks No.1 in the MTEB multilingual leaderboard (as of June 5, 2025, score 70.58), while the reranking model excels in various text retrieval scenarios. Comprehensive Flexibility: The Qwen3 Embedding series offers a full spectrum of sizes (from 0.6B to 8B) for both embedding and reranking models, catering to diverse use cases that prioritize efficiency and effectiveness. Developers can seamlessly combine these two modules. Additionally, the embedding model allows for flexible vector definitions across all dimensions, and both embedding and reranking models support user-defined instructions to enhance performance for specific tasks, languages, or scenarios. Multilingual Capability: The Qwen3 Embedding series offer support for over 100 languages, thanks to the multilingual capabilites of Qwen3 models. This includes various programming languages, and provides robust multilingual, cross-lingual, and code retrieval capabilities. Qwen3-Embedding-0.6B-GGUF has the following features: - Model Type: Text Embedding - Supported Languages: 100+ Languages - Number of Paramaters: 0.6B - Context Length: 32k - Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024 - Quantization: q80, f16 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub. | Model Type | Models | Size | Layers | Sequence Length | Embedding Dimension | MRL Support | Instruction Aware | |------------------|----------------------|------|--------|-----------------|---------------------|-------------|----------------| | Text Embedding | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | Yes | Yes | | Text Embedding | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | Yes | Yes | | Text Embedding | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | Yes | Yes | | Text Reranking | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | Yes | | Text Reranking | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | Yes | > Note: > - `MRL Support` indicates whether the embedding model supports custom dimensions for the final embedding. > - `Instruction Aware` notes whether the embedding or reranking model supports customizing the input instruction according to different tasks. > - Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English. πŸ“Œ Tip: We recommend that developers customize the `instruct` according to their specific scenarios, tasks, and languages. Our tests have shown that in most retrieval scenarios, not using an `instruct` on the query side can lead to a drop in retrieval performance by approximately 1% to 5%. llama.cpp Check out our llama.cpp documentation for more usage guide. We advise you to clone `llama.cpp` and install it following the official guide. We follow the latest version of llama.cpp. In the following demonstration, we assume that you are running commands under the repository `llama.cpp`. | Model | Size | Mean (Task) | Mean (Type) | Bitxt Mining | Class. | Clust. | Inst. Retri. | Multi. Class. | Pair. Class. | Rerank | Retri. | STS | |----------------------------------|:-------:|:-------------:|:-------------:|:--------------:|:--------:|:--------:|:--------------:|:---------------:|:--------------:|:--------:|:--------:|:------:| | NV-Embed-v2 | 7B | 56.29 | 49.58 | 57.84 | 57.29 | 40.80 | 1.04 | 18.63 | 78.94 | 63.82 | 56.72 | 71.10| | GritLM-7B | 7B | 60.92 | 53.74 | 70.53 | 61.83 | 49.75 | 3.45 | 22.77 | 79.94 | 63.78 | 58.31 | 73.33| | BGE-M3 | 0.6B | 59.56 | 52.18 | 79.11 | 60.35 | 40.88 | -3.11 | 20.1 | 80.76 | 62.79 | 54.60 | 74.12| | multilingual-e5-large-instruct | 0.6B | 63.22 | 55.08 | 80.13 | 64.94 | 50.75 | -0.40 | 22.91 | 80.86 | 62.61 | 57.12 | 76.81| | gte-Qwen2-1.5B-instruct | 1.5B | 59.45 | 52.69 | 62.51 | 58.32 | 52.05 | 0.74 | 24.02 | 81.58 | 62.58 | 60.78 | 71.61| | gte-Qwen2-7b-Instruct | 7B | 62.51 | 55.93 | 73.92 | 61.55 | 52.77 | 4.94 | 25.48 | 85.13 | 65.55 | 60.08 | 73.98| | text-embedding-3-large | - | 58.93 | 51.41 | 62.17 | 60.27 | 46.89 | -2.68 | 22.03 | 79.17 | 63.89 | 59.27 | 71.68| | Cohere-embed-multilingual-v3.0 | - | 61.12 | 53.23 | 70.50 | 62.95 | 46.89 | -1.89 | 22.74 | 79.88 | 64.07 | 59.16 | 74.80| | Gemini Embedding | - | 68.37 | 59.59 | 79.28 | 71.82 | 54.59 | 5.18 | 29.16 | 83.63 | 65.58 | 67.71 | 79.40| | Qwen3-Embedding-0.6B | 0.6B | 64.33 | 56.00 | 72.22 | 66.83 | 52.33 | 5.09 | 24.59 | 80.83 | 61.41 | 64.64 | 76.17| | Qwen3-Embedding-4B | 4B | 69.45 | 60.86 | 79.36 | 72.33 | 57.15 | 11.56 | 26.77 | 85.05 | 65.08 | 69.60 | 80.86| | Qwen3-Embedding-8B | 8B | 70.58 | 61.69 | 80.89 | 74.00 | 57.65 | 10.06 | 28.66 | 86.40 | 65.63 | 70.88 | 81.08 | > Note: For compared models, the scores are retrieved from MTEB online leaderboard on May 24th, 2025. | MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. | |--------------------------------|:--------:|:------------:|:------------:|:--------:|:--------:|:-------------:|:---------:|:--------:|:-------:|:-------:| | multilingual-e5-large-instruct | 0.6B | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 | | NV-Embed-v2 | 7.8B | 69.81 | 65.00 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 | | GritLM-7B | 7.2B | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 | | gte-Qwen2-1.5B-instruct | 1.5B | 67.20 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 | | stellaen1.5Bv5 | 1.5B | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 | | gte-Qwen2-7B-instruct | 7.6B | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 | | gemini-embedding-exp-03-07 | - | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 | | Qwen3-Embedding-0.6B | 0.6B | 70.70 | 64.88 | 85.76 | 54.05 | 84.37 | 48.18 | 61.83 | 86.57 | 33.43 | | Qwen3-Embedding-4B | 4B | 74.60 | 68.10 | 89.84 | 57.51 | 87.01 | 50.76 | 68.46 | 88.72 | 34.39 | | Qwen3-Embedding-8B | 8B | 75.22 | 68.71 | 90.43 | 58.57 | 87.52 | 51.56 | 69.44 | 88.58 | 34.83 | | C-MTEB | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retr. | STS | |------------------|--------|------------|------------|--------|--------|-------------|---------|-------|-------| | multilingual-e5-large-instruct | 0.6B | 58.08 | 58.24 | 69.80 | 48.23 | 64.52 | 57.45 | 63.65 | 45.81 | | bge-multilingual-gemma2 | 9B | 67.64 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | - | | gte-Qwen2-1.5B-instruct | 1.5B | 67.12 | 67.79 | 72.53 | 54.61 | 79.5 | 68.21 | 71.86 | 60.05 | | gte-Qwen2-7B-instruct | 7.6B | 71.62 | 72.19 | 75.77 | 66.06 | 81.16 | 69.24 | 75.70 | 65.20 | | ritrievezhv1 | 0.3B | 72.71 | 73.85 | 76.88 | 66.5 | 85.98 | 72.86 | 76.97 | 63.92 | | Qwen3-Embedding-0.6B | 0.6B | 66.33 | 67.45 | 71.40 | 68.74 | 76.42 | 62.58 | 71.03 | 54.52 | | Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 | | Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 | If you find our work helpful, feel free to give us a cite.

gemma-3-1B-it-qat-GGUF

lmstudio-community

πŸ‘Ύ LM Studio Community models highlights program. Highlighting new & noteworthy models by the community. Join the conversation on Discord. Model creator: google Original model: gemma-3-1b-it GGUF quantization: provided Google Optimized with Quantization Aware Training for improved 4-bit performance. Supports a context length of 32k tokens, with a max output of 8192. Gemma 3 models are well-suited for a variety of text generation, including question answering, summarization, and reasoning. πŸ™ Special thanks to Georgi Gerganov and the whole team working on llama.cpp for making all of this possible. LM Studio is not the creator, originator, or owner of any Model featured in the Community Model Program. Each Community Model is created and provided by third parties. LM Studio does not endorse, support, represent or guarantee the completeness, truthfulness, accuracy, or reliability of any Community Model. You understand that Community Models can produce content that might be offensive, harmful, inaccurate or otherwise inappropriate, or deceptive. Each Community Model is the sole responsibility of the person or entity who originated such Model. LM Studio may not monitor or control the Community Models and cannot, and does not, take responsibility for any such Model. LM Studio disclaims all warranties or guarantees about the accuracy, reliability or benefits of the Community Models. LM Studio further disclaims any warranty that the Community Model will meet your requirements, be secure, uninterrupted or available at any time or location, or error-free, viruses-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to the Community Models, your downloading of any Community Model, or use of any other Community Model provided by or through LM Studio.

Huihui-Ling-flash-2.0-abliterated-GGUF

huihui-ai

This is an uncensored version of inclusionAI/Ling-flash-2.0 created with abliteration (see remove-refusals-with-transformers to know more about it). ggml-org/llama.cpp and im0qianqian/llama.cpp now supports conversion to GGUF format and can be tested using llama-cli. - Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs. - Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security. - Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences. - Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications. - Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content. - No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use. Donation Your donation helps us continue our further development and improvement, a cup of coffee can do it. - bitcoin:

granite-4.0-h-1b-GGUF

unsloth

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model Summary: Granite-4.0-H-1B is a lightweight instruct model finetuned from Granite-4.0-H-1B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-1B model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-1B comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-1B model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-H-1B baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - πŸ“„ Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - πŸ’‘ Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

Ling-flash-2.0-i1-GGUF

mradermacher

weighted/imatrix quants of https://huggingface.co/inclusionAI/Ling-flash-2.0 For a convenient overview and download list, visit our model page for this model. static quants are available at https://huggingface.co/mradermacher/Ling-flash-2.0-GGUF Usage If you are unsure how to use GGUF files, refer to one of TheBloke's READMEs for more details, including on how to concatenate multi-part files. (sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants) | Link | Type | Size/GB | Notes | |:-----|:-----|--------:|:------| | GGUF | imatrix | 0.4 | imatrix file (for creating your own qwuants) | | GGUF | i1-IQ1S | 21.1 | for the desperate | | GGUF | i1-IQ1M | 23.4 | mostly desperate | | GGUF | i1-IQ2XXS | 27.2 | | | GGUF | i1-IQ2XS | 30.3 | | | GGUF | i1-IQ2S | 30.7 | | | GGUF | i1-IQ2M | 33.8 | | | GGUF | i1-Q2KS | 35.1 | very low quality | | GGUF | i1-Q2K | 37.8 | IQ3XXS probably better | | GGUF | i1-IQ3XXS | 39.9 | lower quality | | GGUF | i1-IQ3XS | 42.3 | | | GGUF | i1-IQ3S | 44.7 | beats Q3K | | GGUF | i1-Q3KS | 44.7 | IQ3XS probably better | | GGUF | i1-IQ3M | 45.3 | | | GGUF | i1-Q3KM | 49.4 | IQ3S probably better | | PART 1 PART 2 | i1-Q3KL | 53.5 | IQ3M probably better | | PART 1 PART 2 | i1-IQ4XS | 55.1 | | | PART 1 PART 2 | i1-Q40 | 58.5 | fast, low quality | | PART 1 PART 2 | i1-Q4KS | 58.7 | optimal size/speed/quality | | PART 1 PART 2 | i1-Q4KM | 62.5 | fast, recommended | | PART 1 PART 2 | i1-Q41 | 64.6 | | | PART 1 PART 2 | i1-Q5KS | 71.0 | | | PART 1 PART 2 | i1-Q5KM | 73.3 | | | PART 1 PART 2 | i1-Q6K | 84.6 | practically like static Q6K | Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better): And here are Artefact2's thoughts on the matter: https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9 See https://huggingface.co/mradermacher/modelrequests for some answers to questions you might have and/or if you want some other model quantized. I thank my company, nethype GmbH, for letting me use its servers and providing upgrades to my workstation to enable this work in my free time. Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more imatrix quants, at much higher quality, than I would otherwise be able to.

Huihui-Ling-mini-2.0-abliterated

huihui-ai

This is an uncensored version of inclusionAI/Ling-mini-2.0 created with abliteration (see remove-refusals-with-transformers to know more about it). ggml-org/llama.cpp and im0qianqian/llama.cpp now supports conversion to GGUF format and can be tested using llama-cli. Q4KM may sometimes refuse to respond; it is recommended to use Q80 or f16 instead. - Risk of Sensitive or Controversial Outputs: This model’s safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs. - Not Suitable for All Audiences: Due to limited content filtering, the model’s outputs may be inappropriate for public settings, underage users, or applications requiring high security. - Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences. - Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications. - Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real-time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content. - No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. huihui.ai bears no responsibility for any consequences arising from its use. Donation Your donation helps us continue our further development and improvement, a cup of coffee can do it. - bitcoin:

Showing 200 of 200 compatible models

Getting Started

Recommended Sizes

1-3B params
6.5 tok/s β€’ Excellent
3-7B params
2.5 tok/s β€’ Good
7-10B params
1-2 tok/s β€’ Batch only

Popular Models

β†’ Llama 3.2-3B
β†’ Phi-3-mini
β†’ Qwen 2.5 (3B)
β†’ TinyLlama 1.1B

Quick Setup

llama.cpp
make
ollama
curl | sh
transformers
pip install
setup.sh
# Install llama.cpp for GGUF models
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download Llama 3.2-3B Q4 GGUF
./llama-cli -m llama-3.2-3B.Q4_K_M.gguf -p "Hello"

# Expected: 6-6.5 tokens/sec on Raspberry Pi 5 (4GB)

Get Started with Edge AI

Hardware and cloud options for running models locally

RunPod

Rent GPU starting at $0.34/hour

Best Value

Deploy on cloud GPU or serverless. 70% cheaper than AWS.

Start from $0.34/hr

Amazon

Hardware for edge AI

Hardware

Get the devices you need to run models locally.

Shop Hardware

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.