unsloth

✓ VerifiedCommunity

Efficient fine-tuning tools and optimized models

500 models • 122 total models in database
Sort by:

Meta-Llama-3.1-8B-Instruct-bnb-4bit

--- base_model: meta-llama/Llama-3.1-8B-Instruct language: - en library_name: transformers license: llama3.1 tags: - llama-3 - llama - meta - facebook - unsloth - transformers ---

NaNK
llama
521,359
86

gpt-oss-20b-BF16

--- base_model: - openai/gpt-oss-20b license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - vllm - unsloth ---

NaNK
license:apache-2.0
462,862
24

mistral-7b-v0.3-bnb-4bit

--- language: - en library_name: transformers license: apache-2.0 tags: - unsloth - transformers - mistral - mistral-7b base_model: mistralai/Mistral-7B-v0.3 ---

NaNK
license:apache-2.0
391,343
21

Meta-Llama-3.1-8B-Instruct

--- language: - en library_name: transformers license: llama3.1 tags: - llama-3 - llama - meta - facebook - unsloth - transformers base_model: meta-llama/Llama-3.1-8B-Instruct ---

NaNK
llama
353,682
91

Qwen3-8B-bnb-4bit

--- tags: - unsloth - unsloth base_model: - Qwen/Qwen3-8B license: apache-2.0 ---

NaNK
license:apache-2.0
329,348
5

Llama-3.2-1B-Instruct

--- base_model: meta-llama/Llama-3.2-1B-Instruct language: - en library_name: transformers license: llama3.2 tags: - llama-3 - llama - meta - facebook - unsloth - transformers ---

NaNK
llama
321,556
79

Llama-3.2-3B-Instruct

--- base_model: meta-llama/Llama-3.2-3B-Instruct language: - en library_name: transformers license: llama3.2 tags: - llama-3 - llama - meta - facebook - unsloth - transformers ---

NaNK
llama
288,617
81

DeepSeek-R1-Distill-Qwen-32B-bnb-4bit

--- base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B language: - en license: apache-2.0 library_name: transformers tags: - deepseek - qwen - qwen2 - unsloth - transformers ---

NaNK
license:apache-2.0
239,234
27

gpt-oss-20b-GGUF

--- base_model: - openai/gpt-oss-20b license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - openai - unsloth --- > [!NOTE] > GGUF uploads with our fixes. More details and [Read our guide here.](https://docs.unsloth.ai/basics/gpt-oss) > See our collection for all versions of gpt-oss including GGUF, 4-bit & 16-bit formats

NaNK
license:apache-2.0
239,066
463

Qwen3-30B-A3B-GGUF

--- base_model: Qwen/Qwen3-30B-A3B language: - en library_name: transformers license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE license: apache-2.0 tags: - qwen3 - qwen - unsloth - transformers --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. <em

NaNK
license:apache-2.0
213,371
261

gemma-3-1b-it

--- base_model: google/gemma-3-1b-it language: - en library_name: transformers license: gemma tags: - unsloth - transformers - gemma3 - gemma - google --- See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-g

NaNK
199,842
18

llava-1.5-7b-hf-bnb-4bit

--- base_model: llava-hf/llava-1.5-7b-hf language: - en library_name: transformers pipeline_tag: image-text-to-text license: llama2 tags: - multimodal - llava - vision - unsloth ---

NaNK
license:llama2
198,006
4

gpt-oss-20b-unsloth-bnb-4bit

--- base_model: - openai/gpt-oss-20b license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - openai - unsloth --- See our collection for all versions of gpt-oss including GGUF, 4-bit & 16-bit formats. Learn to run gpt-oss correctly - <a href="https://docs.unsloth.ai/basics/

NaNK
license:apache-2.0
186,690
34

Qwen3-Next-80B-A3B-Instruct-bnb-4bit

--- tags: - unsloth base_model: - Qwen/Qwen3-Next-80B-A3B-Instruct library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/LICENSE pipeline_tag: text-generation --- Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. <div style="display: flex; gap: 5px; align-ite

NaNK
license:apache-2.0
180,688
17

Qwen3-14B-GGUF

--- base_model: Qwen/Qwen3-14B language: - en library_name: transformers license_link: https://huggingface.co/Qwen/Qwen3-14B/blob/main/LICENSE license: apache-2.0 tags: - qwen3 - qwen - unsloth - transformers --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn t

NaNK
license:apache-2.0
176,393
91

gpt-oss-120b-GGUF

--- base_model: - openai/gpt-oss-120b license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - openai - unsloth --- > [!NOTE] > The F16 quant is gpt-oss in its **original** precision. All GGUFs have our fixes. [Read our guide here.](https://docs.unsloth.ai/basics/gpt-oss) > See our collection for all versions of gpt-oss i

NaNK
license:apache-2.0
151,125
165

gemma-2-9b-it-bnb-4bit

NaNK
146,602
30

Qwen3-Coder-30B-A3B-Instruct-GGUF

<div> <p style="margin-bottom: 0; margin-top: 0;"> <strong>See <a href="https://huggingface.co/collections/unsloth/qwen3-680edabfb790c8c34a242f95">our collection</a> for all versions of Qwen3 inclu...

NaNK
license:apache-2.0
131,020
314

Qwen3-4B-Instruct-2507-unsloth-bnb-4bit

NaNK
license:apache-2.0
130,849
10

Qwen3-8B-unsloth-bnb-4bit

NaNK
license:apache-2.0
102,927
11

Qwen2.5-0.5B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
96,549
4

DeepSeek-R1-0528-Qwen3-8B-GGUF

Learn how to run DeepSeek-R1-0528 correctly - Read our Guide . See our collection for all versions of R1 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & out...

NaNK
license:mit
92,643
337

Qwen3-0.6B-unsloth-bnb-4bit

--- base_model: Qwen/Qwen3-0.6B language: - en library_name: transformers license_link: https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE license: apache-2.0 tags: - qwen3 - qwen - unsloth - transformers --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn

NaNK
license:apache-2.0
85,127
19

Qwen2.5-VL-7B-Instruct-GGUF

--- base_model: - Qwen/Qwen2.5-VL-7B-Instruct license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal - unsloth library_name: transformers ---

NaNK
license:apache-2.0
82,509
98

Qwen2.5-7B-Instruct

NaNK
license:apache-2.0
80,576
18

llama-3-8b-Instruct-bnb-4bit

NaNK
llama
77,839
133

tinyllama-chat-bnb-4bit

NaNK
llama
72,210
5

meta-Llama-3.1-8B-unsloth-bnb-4bit

NaNK
llama
68,758
1

Llama-3.2-1B-Instruct-unsloth-bnb-4bit

NaNK
llama
68,504
5

gpt-oss-120b

Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of these open models: - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single H100 GPU (117B parameters with 5.1B active parameters) - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. > [!NOTE] > This model card is dedicated to the larger `gpt-oss-120b` model. Check out `gpt-oss-20b` for the smaller model. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning. Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package. To get started, install the necessary dependencies to setup your environment: Once, setup you can proceed to run the model by running the snippet below: Alternatively, you can run the model via `Transformers Serve` to spin up a OpenAI-compatible webserver: Learn more about how to use gpt-oss with Transformers. vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository. If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama. If you are using LM Studio you can use the following commands to download. Check out our awesome list for a broader collection of gpt-oss resources and inference partners. You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI: You can adjust the reasoning level that suits your task across three levels: Low: Fast responses for general dialogue. Medium: Balanced speed and detail. High: Deep and detailed analysis. The reasoning level can be set in the system prompts, e.g., "Reasoning: high". The gpt-oss models are excellent for: Web browsing (using built-in browsing tools) Function calling with defined schemas Agentic operations like browser tasks Both gpt-oss models can be fine-tuned for a variety of specialized use cases. This larger model `gpt-oss-120b` can be fine-tuned on a single H100 node, whereas the smaller `gpt-oss-20b` can even be fine-tuned on consumer hardware.

NaNK
license:apache-2.0
67,349
14

gemma-3-4b-it-GGUF

--- base_model: google/gemma-3-4b-it language: - en library_name: transformers license: gemma tags: - unsloth - transformers - gemma3 - gemma - google --- See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. <a href="https://docs.unsloth.ai/basics/tutorial-how-to-run-g

NaNK
64,503
150

GLM-4.6-GGUF

Read our How to Run GLM-4.6 Guide! > [!NOTE] > Please use latest version of `llama.cpp`. This GGUF includes multiple Unsloth chat template fixes! For `llama.cpp`, please use `--jinja` > Unsloth Dyn...

license:mit
63,537
126

mistral-7b-instruct-v0.3-bnb-4bit

--- language: - en library_name: transformers license: apache-2.0 tags: - unsloth - transformers - mistral - mistral-7b - mistral-instruct - instruct base_model: mistralai/Mistral-7B-Instruct-v0.3 ---

NaNK
license:apache-2.0
59,133
33

Llama-3.2-3B-Instruct-unsloth-bnb-4bit

NaNK
llama
59,103
10

Qwen3-1.7B-unsloth-bnb-4bit

NaNK
license:apache-2.0
57,329
9

Phi-3-mini-4k-instruct-bnb-4bit

Reminder to use the dev version Transformers: `pip install git+https://github.com/huggingface/transformers.git` Finetune Phi-3.5, Llama 3.1, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Phi-3.5 (mini) here: https://colab.research.google.com/drive/1lN6hPQveBmHSnTOYifygFcrO8C1bxq4?usp=sharing All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to Microsoft AI and Phi team for creating and releasing these models.

NaNK
license:mit
57,082
38

MiniMax-M2-GGUF

Today, we release and open source MiniMax-M2, a Mini model built for Max coding & agentic workflows. MiniMax-M2 redefines efficiency for agents. It's a compact, fast, and cost-effective MoE model (230 billion total parameters with 10 billion active parameters) built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, MiniMax-M2 provides the sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever. Superior Intelligence. According to benchmarks from Artificial Analysis, MiniMax-M2 demonstrates highly competitive general intelligence across mathematics, science, instruction following, coding, and agentic tool use. Its composite score ranks #1 among open-source models globally. Advanced Coding. Engineered for end-to-end developer workflows, MiniMax-M2 excels at multi-file edits, coding-run-fix loops, and test-validated repairs. Strong performance on Terminal-Bench and (Multi-)SWE-Bench–style tasks demonstrates practical effectiveness in terminals, IDEs, and CI across languages. Agent Performance. MiniMax-M2 plans and executes complex, long-horizon toolchains across shell, browser, retrieval, and code runners. In BrowseComp-style evaluations, it consistently locates hard-to-surface sources, maintains evidence traceable, and gracefully recovers from flaky steps. Efficient Design. With 10 billion activated parameters (230 billion in total), MiniMax-M2 delivers lower latency, lower cost, and higher throughput for interactive agents and batched sampling—perfectly aligned with the shift toward highly deployable models that still shine on coding and agentic tasks. These comprehensive evaluations test real-world end-to-end coding and agentic tool use: editing real repos, executing commands, browsing the web, and delivering functional solutions. Performance on this suite correlates with day-to-day developer experience in terminals, IDEs, and CI. | Benchmark | MiniMax-M2 | Claude Sonnet 4 | Claude Sonnet 4.5 | Gemini 2.5 Pro | GPT-5 (thinking) | GLM-4.6 | Kimi K2 0905 | DeepSeek-V3.2 | |-----------|------------|-----------------|-------------------|-----------------|------------------|---------|---------------|----------------| | SWE-bench Verified | 69.4 | 72.7 | 77.2 | 63.8 | 74.9 | 68 | 69.2 | 67.8 | | Multi-SWE-Bench | 36.2 | 35.7 | 44.3 | / | / | 30 | 33.5 | 30.6 | | SWE-bench Multilingual | 56.5 | 56.9 | 68 | / | / | 53.8 | 55.9 | 57.9 | | Terminal-Bench | 46.3 | 36.4 | 50 | 25.3 | 43.8 | 40.5 | 44.5 | 37.7 | | ArtifactsBench | 66.8 | 57.3 | 61.5 | 57.7 | 73 | 59.8 | 54.2 | 55.8 | | BrowseComp | 44 | 12.2 | 19.6 | 9.9 | 54.9 | 45.1 | 14.1 | 40.1 | | BrowseComp-zh | 48.5 | 29.1 | 40.8 | 32.2 | 65 | 49.5 | 28.8 | 47.9 | | GAIA (text only) | 75.7 | 68.3 | 71.2 | 60.2 | 76.4 | 71.9 | 60.2 | 63.5 | | xbench-DeepSearch | 72 | 64.6 | 66 | 56 | 77.8 | 70 | 61 | 71 | | HLE (w/ tools) | 31.8 | 20.3 | 24.5 | 28.4 | 35.2 | 30.4 | 26.9 | 27.2 | | τ²-Bench | 77.2 | 65.5 | 84.7 | 59.2 | 80.1 | 75.9 | 70.3 | 66.7 | | FinSearchComp-global | 65.5 | 42 | 60.8 | 42.6 | 63.9 | 29.2 | 29.5 | 26.2 | | AgentCompany | 36 | 37 | 41 | 39.3 | / | 35 | 30 | 34 | >Notes: Data points marked with an asterisk () are taken directly from the model's official tech report or blog. All other metrics were obtained using the evaluation methods described below. >- SWE-bench Verified: We use the same scaffold as R2E-Gym (Jain et al. 2025) on top of OpenHands to test with agents on SWE tasks. All scores are validated on our internal infrastructure with 128k context length, 100 max steps, and no test-time scaling. All git-related content is removed to ensure agent sees only the code at the issue point. >- Multi-SWE-Bench & SWE-bench Multilingual: All scores are averaged across 8 runs using the claude-code CLI (300 max steps) as the evaluation scaffold. >- Terminal-Bench: All scores are evaluated with the official claude-code from the original Terminal-Bench repository(commit `94bf692`), averaged over 8 runs to report the mean pass rate. >- ArtifactsBench: All Scores are computed by averaging three runs with the official implementation of ArtifactsBench, using the stable Gemini-2.5-Pro as the judge model. >- BrowseComp & BrowseComp-zh & GAIA (text only) & xbench-DeepSearch: All scores reported use the same agent framework as WebExplorer (Liu et al. 2025), with minor tools description adjustment. We use the 103-sample text-only GAIA validation subset following WebExplorer (Liu et al. 2025). >- HLE (w/ tools): All reported scores are obtained using search tools and a Python tool. The search tools employ the same agent framework as WebExplorer (Liu et al. 2025), and the Python tool runs in a Jupyter environment. We use the text-only HLE subset. >- τ²-Bench: All scores reported use "extended thinking with tool use", and employ GPT-4.1 as the user simulator. >- FinSearchComp-global: Official results are reported for GPT-5-Thinking, Gemini 2.5 Pro, and Kimi-K2. Other models are evaluated using the open-source FinSearchComp (Hu et al. 2025) framework using both search and Python tools, launched simultaneously for consistency. >- AgentCompany: All scores reported use OpenHands 0.42 agent framework. We align with Artificial Analysis, which aggregates challenging benchmarks using a consistent methodology to reflect a model’s broader intelligence profile across math, science, instruction following, coding, and agentic tool use. | Metric (AA) | MiniMax-M2 | Claude Sonnet 4 | Claude Sonnet 4.5 | Gemini 2.5 Pro | GPT-5 (thinking) | GLM-4.6 | Kimi K2 0905 | DeepSeek-V3.2 | |-----------------|----------------|---------------------|------------------------|---------------------|----------------------|-------------|------------------|-------------------| | AIME25 | 78 | 74 | 88 | 88 | 94 | 86 | 57 | 88 | | MMLU-Pro | 82 | 84 | 88 | 86 | 87 | 83 | 82 | 85 | | GPQA-Diamond | 78 | 78 | 83 | 84 | 85 | 78 | 77 | 80 | | HLE (w/o tools) | 12.5 | 9.6 | 17.3 | 21.1 | 26.5 | 13.3 | 6.3 | 13.8 | | LiveCodeBench (LCB) | 83 | 66 | 71 | 80 | 85 | 70 | 61 | 79 | | SciCode | 36 | 40 | 45 | 43 | 43 | 38 | 31 | 38 | | IFBench | 72 | 55 | 57 | 49 | 73 | 43 | 42 | 54 | | AA-LCR | 61 | 65 | 66 | 66 | 76 | 54 | 52 | 69 | | τ²-Bench-Telecom | 87 | 65 | 78 | 54 | 85 | 71 | 73 | 34 | | Terminal-Bench-Hard | 24 | 30 | 33 | 25 | 31 | 23 | 23 | 29 | | AA Intelligence | 61 | 57 | 63 | 60 | 69 | 56 | 50 | 57 | >AA: All scores of MiniMax-M2 aligned with Artificial Analysis Intelligence Benchmarking Methodology (https://artificialanalysis.ai/methodology/intelligence-benchmarking). All scores of other models reported from https://artificialanalysis.ai/. By maintaining activations around 10B , the plan → act → verify loop in the agentic workflow is streamlined, improving responsiveness and reducing compute overhead: - Faster feedback cycles in compile-run-test and browse-retrieve-cite chains. - More concurrent runs on the same budget for regression suites and multi-seed explorations. - Simpler capacity planning with smaller per-request memory and steadier tail latency. In short: 10B activations = responsive agent loops + better unit economics. If you need frontier-style coding and agents without frontier-scale costs, MiniMax-M2 hits the sweet spot: fast inference speeds, robust tool-use capabilities, and a deployment-friendly footprint. We look forward to your feedback and to collaborating with developers and researchers to bring the future of intelligent collaboration one step closer. - Our product MiniMax Agent, built on MiniMax-M2, is now publicly available and free for a limited time: https://agent.minimax.io/ - The MiniMax-M2 API is now live on the MiniMax Open Platform and is free for a limited time: https://platform.minimax.io/docs/guides/text-generation - The MiniMax-M2 model weights are now open-source, allowing for local deployment and use: https://huggingface.co/MiniMaxAI/MiniMax-M2. Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2. We recommend using the following inference frameworks (listed alphabetically) to serve the model: We recommend using SGLang to serve MiniMax-M2. SGLang provides solid day-0 support for MiniMax-M2 model. Please refer to our SGLang Deployment Guide for more details, and thanks so much for our collaboration with the SGLang team. We recommend using vLLM to serve MiniMax-M2. vLLM provides efficient day-0 support of MiniMax-M2 model, check https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html for latest deployment guide. We also provide our vLLM Deployment Guide. We recommend using MLX-LM to serve MiniMax-M2. Please refer to our MLX Deployment Guide for more details. Inference Parameters We recommend using the following parameters for best performance: `temperature=1.0`, `topp = 0.95`, `topk = 40`. IMPORTANT: MiniMax-M2 is an interleaved thinking model. Therefore, when using it, it is important to retain the thinking content from the assistant's turns within the historical messages. In the model's output content, we use the ` ... ` format to wrap the assistant's thinking content. When using the model, you must ensure that the historical content is passed back in its original format. Do not remove the ` ... ` part, otherwise, the model's performance will be negatively affected. > The projects below are built and maintained by the community/partners. They are not official MiniMax products, and results may vary. - AnyCoder — a web IDE–style coding assistant Space on Hugging Face, uses MiniMax-M2 as the default model: https://huggingface.co/spaces/akhaliq/anycoder Maintainer: @akhaliq (Hugging Face)

license:mit
56,110
54

gemma-3-4b-it-unsloth-bnb-4bit

NaNK
54,760
22

granite-4.0-h-small-GGUF

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Small model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Small comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Small model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Small baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
54,342
38

Qwen3-30B-A3B-Instruct-2507-GGUF

--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507/blob/main/LICENSE base_model: - Qwen/Qwen3-30B-A3B-Instruct-2507 tags: - qwen - qwen3 - unsloth --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. <em

NaNK
license:apache-2.0
53,533
259

Qwen3-VL-8B-Thinking-1M-GGUF

NaNK
license:apache-2.0
53,187
3

Qwen3-1.7B-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | To Switch Between Thinking and Non-Thinking If you are using llama.cpp, Ollama, Open WebUI etc., you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-1.7B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 1.7B - Number of Paramaters (Non-Embedding): 1.4B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-1.7B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-1.7B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-1.7B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
license:apache-2.0
52,693
54

Llama-3.1-8B-Instruct-unsloth-bnb-4bit

NaNK
llama
52,398
4

Qwen3-4B-Instruct-2507-GGUF

--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE base_model: - Qwen/Qwen3-4B-Instruct-2507 tags: - qwen - qwen3 - unsloth --- See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to

NaNK
license:apache-2.0
52,123
89

GLM-4.5-Air-GGUF

> [!NOTE] > Includes Unsloth **chat template fixes**! <br> For `llama.cpp`, use `--jinja` >

license:mit
51,960
114

gemma-3-12b-it-unsloth-bnb-4bit

NaNK
51,109
23

Mistral-Nemo-Instruct-2407-bnb-4bit

NaNK
license:apache-2.0
51,004
30

Qwen3-VL-8B-Instruct-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! > See our Qwen3-VL collection for all versions including GGUF, 4-bit & 16-bit formats. See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3-VL-8B for free using our Google Colab notebook - Or train Qwen3-VL with reinforcement learning (GSPO) with our free notebook. - View the rest of our notebooks in our docs here. --- Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
49,390
11

gemma-3-4b-it

NaNK
49,210
19

Qwen3-VL-30B-A3B-Instruct-GGUF

NaNK
license:apache-2.0
48,581
27

Mistral-Small-24B-Instruct-2501

NaNK
license:apache-2.0
47,745
7

gemma-3-27b-it-GGUF

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Read our Guide to see how to Run Gemma 3 correctly. - Fine-tune Gemma 3 (12B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Gemma 3 (12B) | ▶️ Start on Colab | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation: | Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://goo.gle/Gemma3Report [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
47,711
168

Mistral-Small-3.1-24B-Instruct-2503

NaNK
license:apache-2.0
47,663
16

gemma-3-270m-it

47,105
19

phi-4-unsloth-bnb-4bit

NaNK
llama
46,601
61

Qwen2.5-14B-Instruct

NaNK
license:apache-2.0
46,156
11

Qwen3-VL-8B-Instruct-unsloth-bnb-4bit

See our Qwen3-VL collection for all versions including GGUF, 4-bit & 16-bit formats. See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3-VL-8B for free using our Google Colab notebook - Or train Qwen3-VL with reinforcement learning (GSPO) with our free notebook. - View the rest of our notebooks in our docs here. --- Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
44,262
9

gemma-3-12b-it-GGUF

NaNK
43,691
106

Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit

NaNK
llama
41,834
4

Qwen3-14B-unsloth-bnb-4bit

NaNK
license:apache-2.0
41,418
9

DeepSeek-V3.1-Terminus-BF16

license:mit
40,305
0

Qwen3-4B-unsloth-bnb-4bit

NaNK
license:apache-2.0
39,926
14

llama-3-8b-bnb-4bit

Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb Finetune Llama 3.3, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb unsloth/Llama-3-8B-bnb-4bit For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety. Variations Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants. Model Architecture Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Llama 3 family of models. Token counts refer to pretraining data only. Both the 8 and 70B versions use Grouped-Query Attention (GQA) for improved inference scalability. Status This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license is available at: https://llama.meta.com/llama3/license Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3 in applications, please go here. Intended Use Cases Llama 3 is intended for commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3 Community License. Use in languages other than English. Note: Developers may fine-tune Llama 3 models for languages beyond English provided they comply with the Llama 3 Community License and the Acceptable Use Policy. This repository contains two versions of Meta-Llama-3-70B-Instruct, for use with transformers and with the original `llama3` codebase. To download Original checkpoints, see the example command below leveraging `huggingface-cli`: For Hugging Face support, we recommend using transformers or TGI, but a similar command works. Training Factors We used custom training libraries, Meta's Research SuperCluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute. Carbon Footprint Pretraining utilized a cumulative 7.7M GPU hours of computation on hardware of type H100-80GB (TDP of 700W). Estimated total emissions were 2290 tCO2eq, 100% of which were offset by Meta’s sustainability program. CO2 emissions during pre-training. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others. Overview Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data. Data Freshness The pretraining data has a cutoff of March 2023 for the 7B and December 2023 for the 70B models respectively. In this section, we report the results for Llama 3 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. For details on the methodology see here. We believe that an open approach to AI leads to better, safer products, faster innovation, and a bigger overall market. We are committed to Responsible AI development and took a series of steps to limit misuse and harm and support the open source community. Foundation models are widely capable technologies that are built to be used for a diverse range of applications. They are not designed to meet every developer preference on safety levels for all use cases, out-of-the-box, as those by their nature will differ across different applications. Rather, responsible LLM-application deployment is achieved by implementing a series of safety best practices throughout the development of such applications, from the model pre-training, fine-tuning and the deployment of systems composed of safeguards to tailor the safety needs specifically to the use case and audience. As part of the Llama 3 release, we updated our Responsible Use Guide to outline the steps and best practices for developers to implement model and system level safety for their application. We also provide a set of resources including Meta Llama Guard 2 and Code Shield safeguards. These tools have proven to drastically reduce residual risks of LLM Systems, while maintaining a high level of helpfulness. We encourage developers to tune and deploy these safeguards according to their needs and we provide a reference implementation to get you started. As outlined in the Responsible Use Guide, some trade-off between model helpfulness and model alignment is likely unavoidable. Developers should exercise discretion about how to weigh the benefits of alignment and helpfulness for their specific use case and audience. Developers should be mindful of residual risks when using Llama models and leverage additional safety tools as needed to reach the right safety bar for their use case. For our instruction tuned model, we conducted extensive red teaming exercises, performed adversarial evaluations and implemented safety mitigations techniques to lower residual risks. As with any Large Language Model, residual risks will likely remain and we recommend that developers assess these risks in the context of their use case. In parallel, we are working with the community to make AI safety benchmark standards transparent, rigorous and interpretable. In addition to residual risks, we put a great emphasis on model refusals to benign prompts. Over-refusing not only can impact the user experience but could even be harmful in certain contexts as well. We’ve heard the feedback from the developer community and improved our fine tuning to ensure that Llama 3 is significantly less likely to falsely refuse to answer prompts than Llama 2. We built internal benchmarks and developed mitigations to limit false refusals making Llama 3 our most helpful model to date. In addition to responsible use considerations outlined above, we followed a rigorous process that requires us to take extra measures against misuse and critical risks before we make our release decision. If you access or use Llama 3, you agree to the Acceptable Use Policy. The most recent copy of this policy can be found at https://llama.meta.com/llama3/use-policy/. CBRNE (Chemical, Biological, Radiological, Nuclear, and high yield Explosives) We have conducted a two fold assessment of the safety of the model in this area: Iterative testing during model training to assess the safety of responses related to CBRNE threats and other adversarial risks. Involving external CBRNE experts to conduct an uplift test assessing the ability of the model to accurately provide expert knowledge and reduce barriers to potential CBRNE misuse, by reference to what can be achieved using web search (without the model). We have evaluated Llama 3 with CyberSecEval, Meta’s cybersecurity safety eval suite, measuring Llama 3’s propensity to suggest insecure code when used as a coding assistant, and Llama 3’s propensity to comply with requests to help carry out cyber attacks, where attacks are defined by the industry standard MITRE ATT&CK cyber attack ontology. On our insecure coding and cyber attacker helpfulness tests, Llama 3 behaved in the same range or safer than models of equivalent coding capability. Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership in AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. The core values of Llama 3 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has been in English, and has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3 models, developers should perform safety testing and tuning tailored to their specific applications of the model. As outlined in the Responsible Use Guide, we recommend incorporating Purple Llama solutions into your workflows and specifically Llama Guard which provides a base model to filter input and output prompts to layer system-level safety on top of model-level safety. Please see the Responsible Use Guide available at http://llama.meta.com/responsible-use-guide url = {https://github.com/meta-llama/llama3/blob/main/MODELCARD.md} Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf Eisenman; Aston Zhang; Aurelien Rodriguez; Austen Gregerson; Ava Spataru; Baptiste Roziere; Ben Maurer; Benjamin Leonhardi; Bernie Huang; Bhargavi Paranjape; Bing Liu; Binh Tang; Bobbie Chern; Brani Stojkovic; Brian Fuller; Catalina Mejia Arenas; Chao Zhou; Charlotte Caucheteux; Chaya Nayak; Ching-Hsiang Chu; Chloe Bi; Chris Cai; Chris Cox; Chris Marra; Chris McConnell; Christian Keller; Christoph Feichtenhofer; Christophe Touret; Chunyang Wu; Corinne Wong; Cristian Canton Ferrer; Damien Allonsius; Daniel Kreymer; Daniel Haziza; Daniel Li; Danielle Pintz; Danny Livshits; Danny Wyatt; David Adkins; David Esiobu; David Xu; Davide Testuggine; Delia David; Devi Parikh; Dhruv Choudhary; Dhruv Mahajan; Diana Liskovich; Diego Garcia-Olano; Diego Perino; Dieuwke Hupkes; Dingkang Wang; Dustin Holland; Egor Lakomkin; Elina Lobanova; Xiaoqing Ellen Tan; Emily Dinan; Eric Smith; Erik Brinkman; Esteban Arcaute; Filip Radenovic; Firat Ozgenel; Francesco Caggioni; Frank Seide; Frank Zhang; Gabriel Synnaeve; Gabriella Schwarz; Gabrielle Lee; Gada Badeer; Georgia Anderson; Graeme Nail; Gregoire Mialon; Guan Pang; Guillem Cucurell; Hailey Nguyen; Hannah Korevaar; Hannah Wang; Haroun Habeeb; Harrison Rudolph; Henry Aspegren; Hu Xu; Hugo Touvron; Iga Kozlowska; Igor Molybog; Igor Tufanov; Iliyan Zarov; Imanol Arrieta Ibarra; Irina-Elena Veliche; Isabel Kloumann; Ishan Misra; Ivan Evtimov; Jacob Xu; Jade Copet; Jake Weissman; Jan Geffert; Jana Vranes; Japhet Asher; Jason Park; Jay Mahadeokar; Jean-Baptiste Gaya; Jeet Shah; Jelmer van der Linde; Jennifer Chan; Jenny Hong; Jenya Lee; Jeremy Fu; Jeremy Teboul; Jianfeng Chi; Jianyu Huang; Jie Wang; Jiecao Yu; Joanna Bitton; Joe Spisak; Joelle Pineau; Jon Carvill; Jongsoo Park; Joseph Rocca; Joshua Johnstun; Junteng Jia; Kalyan Vasuden Alwala; Kam Hou U; Kate Plawiak; Kartikeya Upasani; Kaushik Veeraraghavan; Ke Li; Kenneth Heafield; Kevin Stone; Khalid El-Arini; Krithika Iyer; Kshitiz Malik; Kuenley Chiu; Kunal Bhalla; Kyle Huang; Lakshya Garg; Lauren Rantala-Yeary; Laurens van der Maaten; Lawrence Chen; Leandro Silva; Lee Bell; Lei Zhang; Liang Tan; Louis Martin; Lovish Madaan; Luca Wehrstedt; Lukas Blecher; Luke de Oliveira; Madeline Muzzi; Madian Khabsa; Manav Avlani; Mannat Singh; Manohar Paluri; Mark Zuckerberg; Marcin Kardas; Martynas Mankus; Mathew Oldham; Mathieu Rita; Matthew Lennie; Maya Pavlova; Meghan Keneally; Melanie Kambadur; Mihir Patel; Mikayel Samvelyan; Mike Clark; Mike Lewis; Min Si; Mitesh Kumar Singh; Mo Metanat; Mona Hassan; Naman Goyal; Narjes Torabi; Nicolas Usunier; Nikolay Bashlykov; Nikolay Bogoychev; Niladri Chatterji; Ning Dong; Oliver Aobo Yang; Olivier Duchenne; Onur Celebi; Parth Parekh; Patrick Alrassy; Paul Saab; Pavan Balaji; Pedro Rittner; Pengchuan Zhang; Pengwei Li; Petar Vasic; Peter Weng; Polina Zvyagina; Prajjwal Bhargava; Pratik Dubal; Praveen Krishnan; Punit Singh Koura; Qing He; Rachel Rodriguez; Ragavan Srinivasan; Rahul Mitra; Ramon Calderer; Raymond Li; Robert Stojnic; Roberta Raileanu; Robin Battey; Rocky Wang; Rohit Girdhar; Rohit Patel; Romain Sauvestre; Ronnie Polidoro; Roshan Sumbaly; Ross Taylor; Ruan Silva; Rui Hou; Rui Wang; Russ Howes; Ruty Rinott; Saghar Hosseini; Sai Jayesh Bondu; Samyak Datta; Sanjay Singh; Sara Chugh; Sargun Dhillon; Satadru Pan; Sean Bell; Sergey Edunov; Shaoliang Nie; Sharan Narang; Sharath Raparthy; Shaun Lindsay; Sheng Feng; Sheng Shen; Shenghao Lin; Shiva Shankar; Shruti Bhosale; Shun Zhang; Simon Vandenhende; Sinong Wang; Seohyun Sonia Kim; Soumya Batra; Sten Sootla; Steve Kehoe; Suchin Gururangan; Sumit Gupta; Sunny Virk; Sydney Borodinsky; Tamar Glaser; Tamar Herman; Tamara Best; Tara Fowler; Thomas Georgiou; Thomas Scialom; Tianhe Li; Todor Mihaylov; Tong Xiao; Ujjwal Karn; Vedanuj Goswami; Vibhor Gupta; Vignesh Ramanathan; Viktor Kerkez; Vinay Satish Kumar; Vincent Gonguet; Vish Vogeti; Vlad Poenaru; Vlad Tiberiu Mihailescu; Vladan Petrovic; Vladimir Ivanov; Wei Li; Weiwei Chu; Wenhan Xiong; Wenyin Fu; Wes Bouaziz; Whitney Meers; Will Constable; Xavier Martinet; Xiaojian Wu; Xinbo Gao; Xinfeng Xie; Xuchao Jia; Yaelle Goldschlag; Yann LeCun; Yashesh Gaur; Yasmine Babaei; Ye Qi; Yenda Li; Yi Wen; Yiwen Song; Youngjin Nam; Yuchen Hao; Yuchen Zhang; Yun Wang; Yuning Mao; Yuzi He; Zacharie Delpierre Coudert; Zachary DeVito; Zahra Hankir; Zhaoduo Wen; Zheng Yan; Zhengxing Chen; Zhenyu Yang; Zoe Papakipos

NaNK
llama
38,786
202

gemma-3-1b-it-GGUF

NaNK
37,097
59

gemma-3n-E4B-it-GGUF

Learn how to run & fine-tune Gemma 3n correctly - Read our Guide . See our collection for all versions of Gemma 3n including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves SOTA accuracy & performance versus other quants. - Currently only text is supported. - Ollama: `ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4KXL` - auto-sets correct chat template and settings - Set temperature = 1.0, topk = 64, topp = 0.95, minp = 0.0 - Gemma 3n max tokens (context length): 32K. Gemma 3n chat template: - For complete detailed instructions, see our step-by-step guide. - Fine-tune Gemma 3n (4B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3n support: unsloth.ai/blog/gemma-3n - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma-3n-E4B | ▶️ Start on Colab | 2x faster | 60% less | | GRPO with Gemma 3 (1B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Gemma 3 (4B) Vision | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen3 (14B) | ▶️ Start on Colab-Reasoning-Conversational.ipynb) | 2x faster | 60% less | | DeepSeek-R1-0528-Qwen3-8B (14B) | ▶️ Start on ColabGRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | - Responsible Generative AI Toolkit - Gemma on Kaggle - Gemma on HuggingFace - Gemma on Vertex Model Garden Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages. Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each - Audio data encoded to 6.25 tokens per second from a single channel - Total input context of 32K tokens - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output length up to 32K tokens, subtracting the request input tokens Usage Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3n is supported starting from transformers 4.53.0. Then, copy the snippet from the section that is relevant for your use case. You can initialize the model and processor for inference with `pipeline` as follows. With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline. Data used for model training and how the data was processed. These models were trained on a dataset that includes a wide variety of sources totalling approximately 11 trillion tokens. The knowledge cutoff date for the training data was June 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. - Audio: A diverse set of sound samples enables the model to recognize speech, transcribe text from recordings, and identify information in audio data. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with our policies. Implementation Information Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training generative models requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training generative models. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. These advantages are aligned with Google's commitments to operate sustainably. Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models: "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated at full precision (float32) against a large collection of different datasets and metrics to cover different aspects of content generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | Metric | n-shot | E2B PT | E4B PT | | ------------------------------ |----------------|----------|:--------:|:--------:| | [HellaSwag][hellaswag] | Accuracy | 10-shot | 72.2 | 78.6 | | [BoolQ][boolq] | Accuracy | 0-shot | 76.4 | 81.6 | | [PIQA][piqa] | Accuracy | 0-shot | 78.9 | 81.0 | | [SocialIQA][socialiqa] | Accuracy | 0-shot | 48.8 | 50.0 | | [TriviaQA][triviaqa] | Accuracy | 5-shot | 60.8 | 70.2 | | [Natural Questions][naturalq] | Accuracy | 5-shot | 15.5 | 20.9 | | [ARC-c][arc] | Accuracy | 25-shot | 51.7 | 61.6 | | [ARC-e][arc] | Accuracy | 0-shot | 75.8 | 81.6 | | [WinoGrande][winogrande] | Accuracy | 5-shot | 66.8 | 71.7 | | [BIG-Bench Hard][bbh] | Accuracy | few-shot | 44.3 | 52.9 | | [DROP][drop] | Token F1 score | 1-shot | 53.9 | 60.8 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------|-------------------------|----------|:--------:|:--------:| | [MGSM][mgsm] | Accuracy | 0-shot | 53.1 | 60.7 | | [WMT24++][wmt24pp] (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 | | [Include][include] | Accuracy | 0-shot | 38.6 | 57.2 | | [MMLU][mmlu] (ProX) | Accuracy | 0-shot | 8.1 | 19.9 | | [OpenAI MMLU][openai-mmlu] | Accuracy | 0-shot | 22.3 | 35.6 | | [Global-MMLU][global-mmlu] | Accuracy | 0-shot | 55.1 | 60.3 | | [ECLeKTic][eclektic] | ECLeKTic score | 0-shot | 2.5 | 1.9 | [mgsm]: https://arxiv.org/abs/2210.03057 [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [include]:https://arxiv.org/abs/2411.19799 [mmlu]: https://arxiv.org/abs/2009.03300 [openai-mmlu]: https://huggingface.co/datasets/openai/MMMLU [global-mmlu]: https://huggingface.co/datasets/CohereLabs/Global-MMLU [eclektic]: https://arxiv.org/abs/2502.21228 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------|--------------------------|----------|:--------:|:--------:| | [GPQA][gpqa] Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 | | [LiveCodeBench][lcb] v5 | pass@1 | 0-shot | 18.6 | 25.7 | | Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 | | [AIME 2025][aime-2025] | Accuracy | 0-shot | 6.7 | 11.6 | [gpqa]: https://arxiv.org/abs/2311.12022 [lcb]: https://arxiv.org/abs/2403.07974 [aime-2025]: https://www.vals.ai/benchmarks/aime-2025-05-09 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------ |------------|----------|:--------:|:--------:| | [MMLU][mmlu] | Accuracy | 0-shot | 60.1 | 64.9 | | [MBPP][mbpp] | pass@1 | 3-shot | 56.6 | 63.6 | | [HumanEval][humaneval] | pass@1 | 0-shot | 66.5 | 75.0 | | [LiveCodeBench][lcb] | pass@1 | 0-shot | 13.2 | 13.2 | | HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 | | [Global-MMLU-Lite][global-mmlu-lite] | Accuracy | 0-shot | 59.0 | 64.5 | | [MMLU][mmlu] (Pro) | Accuracy | 0-shot | 40.5 | 50.6 | [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to high severity violations. A limitation of our evaluations was they included primarily English language prompts. These models have certain limitations that users should be aware of. Open generative models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: Extract, interpret, and summarize visual data for text communications. - Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data. - Research and Education - Natural Language Processing (NLP) and generative model Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics. Limitations - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. Ethical Considerations and Risks The development of generative models raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - Generative models trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - Generative models can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making generative model technology accessible to developers and researchers across the AI ecosystem. Risks identified and mitigations: - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of generative models. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. Benefits At the time of release, this family of models provides high-performance open generative model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives.

NaNK
36,836
171

Qwen2.5-1.5B-unsloth-bnb-4bit

NaNK
license:apache-2.0
36,453
3

DeepSeek-R1-Distill-Llama-8B-GGUF

NaNK
llama
35,146
290

Mistral-Small-3.2-24B-Instruct-2506-GGUF

> [!NOTE] > Includes our GGUF chat template fixes! Tool calling works as well! If you are using `llama.cpp`, use `--jinja` to enable the system prompt. > Unsloth Dynamic 2.0 achieves SOTA performance in model quantization. - Temperature of: 0.15 - Set topp to: 1.00 - Max tokens (context length): 128K - Fine-tune Mistral v0.3 (7B) for free using our Google Colab notebook here-Conversational.ipynb)! - View the rest of our notebooks in our docs here. Mistral-Small-3.2-24B-Instruct-2506 is a minor update of Mistral-Small-3.1-24B-Instruct-2503. Small-3.2 improves in the following categories: - Instruction following: Small-3.2 is better at following precise instructions - Repetition errors: Small-3.2 produces less infinite generations or repetitive answers - Function calling: Small-3.2's function calling template is more robust (see here and examples) In all other categories Small-3.2 should match or slightly improve compared to Mistral-Small-3.1-24B-Instruct-2503. Key Features - same as Mistral-Small-3.1-24B-Instruct-2503 We compare Mistral-Small-3.2-24B to Mistral-Small-3.1-24B-Instruct-2503. For more comparison against other models of similar size, please check Mistral-Small-3.1's Benchmarks' | Model | Wildbench v2 | Arena Hard v2 | IF (Internal; accuracy) | |-------|---------------|---------------|------------------------| | Small 3.1 24B Instruct | 55.6% | 19.56% | 82.75% | | Small 3.2 24B Instruct | 65.33% | 43.1% | 84.78% | Small 3.2 reduces infitine generations by 2x on challenging, long and repetitive prompts. | Model | Infinite Generations (Internal; Lower is better) | |-------|-------| | Small 3.1 24B Instruct | 2.11% | | Small 3.2 24B Instruct | 1.29% | | Model | MMLU | MMLU Pro (5-shot CoT) | MATH | GPQA Main (5-shot CoT) | GPQA Diamond (5-shot CoT )| MBPP Plus - Pass@5 | HumanEval Plus - Pass@5 | SimpleQA (TotalAcc)| |--------------------------------|-----------|-----------------------|------------------------|------------------------|---------------------------|--------------------|-------------------------|--------------------| | Small 3.1 24B Instruct | 80.62% | 66.76% | 69.30% | 44.42% | 45.96% | 74.63% | 88.99% | 10.43% | | Small 3.2 24B Instruct | 80.50% | 69.06% | 69.42% | 44.22% | 46.13% | 78.33% | 92.90% | 12.10% | | Model | MMMU | Mathvista | ChartQA | DocVQA | AI2D | |--------------------------------|------------|-----------|-----------|-----------|-----------| | Small 3.1 24B Instruct | 64.00% | 68.91%| 86.24% | 94.08% | 93.72% | | Small 3.2 24B Instruct | 62.50% | 67.09% | 87.4% | 94.86% | 92.91% | The model can be used with the following frameworks; - `vllm (recommended)`: See here - `transformers`: See here Note 1: We recommend using a relatively low temperature, such as `temperature=0.15`. Note 2: Make sure to add a system prompt to the model to best tailer it for your needs. If you want to use the model as a general assistant, we recommend to use the one provided in the SYSTEMPROMPT.txt file. Doing so should automatically install `mistralcommon >= 1.6.2`. You can also make use of a ready-to-go docker image or on the docker hub. We recommand that you use Mistral-Small-3.2-24B-Instruct-2506 in a server/client setting. Note: Running Mistral-Small-3.2-24B-Instruct-2506 on GPU requires ~55 GB of GPU RAM in bf16 or fp16. 2. To ping the client you can use a simple Python snippet. See the following examples. Take leverage of the vision capabilities of Mistral-Small-3.2-24B-Instruct-2506 to take the best choice given a scenario, go catch them all ! Mistral-Small-3.2-24B-Instruct-2506 is excellent at function / tool calling tasks via vLLM. E.g.: Mistral-Small-3.2-24B-Instruct-2506 will follow your instructions down to the last letter ! You can also use Mistral-Small-3.2-24B-Instruct-2506 with `Transformers` ! To make the best use of our model with `Transformers` make sure to have installed `mistral-common >= 1.6.2` to use our tokenizer. Then load our tokenizer along with the model and generate:

NaNK
license:apache-2.0
34,984
128

Qwen3-4B-Instruct-2507

NaNK
license:apache-2.0
33,621
11

Devstral-Small-2507-GGUF

> [!NOTE] > You should use `--jinja` to enable the system prompt in `llama.cpp`. Devstral 1.1, with tool-calling and optional vision support. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Mistral v0.3 (7B) for free using our Google Colab notebook here-Conversational.ipynb)! - Read our Blog about Devstral 1.1 support: docs.unsloth.ai/basics/devstral - View the rest of our notebooks in our docs here. Devstral is an agentic LLM for software engineering tasks built under a collaboration between Mistral AI and All Hands AI 🙌. Devstral excels at using tools to explore codebases, editing multiple files and power software engineering agents. The model achieves remarkable performance on SWE-bench which positionates it as the #1 open source model on this benchmark. It is finetuned from Mistral-Small-3.1, therefore it has a long context window of up to 128k tokens. As a coding agent, Devstral is text-only and before fine-tuning from `Mistral-Small-3.1` the vision encoder was removed. For enterprises requiring specialized capabilities (increased context, domain-specific knowledge, etc.), we will release commercial models beyond what Mistral AI contributes to the community. Updates compared to `Devstral Small 1.0`: - Improved performance, please refer to the benchmark results. - `Devstral Small 1.1` is still great when paired with OpenHands. This new version also generalizes better to other prompts and coding environments. - Supports Mistral's function calling format. Key Features: - Agentic coding: Devstral is designed to excel at agentic coding tasks, making it a great choice for software engineering agents. - lightweight: with its compact size of just 24 billion parameters, Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an appropriate model for local deployment and on-device use. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. - Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size. Devstral Small 1.1 achieves a score of 53.6% on SWE-Bench Verified, outperforming Devstral Small 1.0 by +6,8% and the second best state of the art model by +11.4%. | Model | Agentic Scaffold | SWE-Bench Verified (%) | |--------------------|--------------------|------------------------| | Devstral Small 1.1 | OpenHands Scaffold | 53.6 | | Devstral Small 1.0 | OpenHands Scaffold | 46.8 | | GPT-4.1-mini | OpenAI Scaffold | 23.6 | | Claude 3.5 Haiku | Anthropic Scaffold | 40.6 | | SWE-smith-LM 32B | SWE-agent Scaffold | 40.2 | | Skywork SWE | OpenHands Scaffold | 38.0 | | DeepSWE | R2E-Gym Scaffold | 42.2 | When evaluated under the same test scaffold (OpenHands, provided by All Hands AI 🙌), Devstral exceeds far larger models such as Deepseek-V3-0324 and Qwen3 232B-A22B. We recommend to use Devstral with the OpenHands scaffold. You can use it either through our API or by running locally. API Follow these instructions to create a Mistral account and get an API key. Then run these commands to start the OpenHands docker container. The model can also be deployed with the following libraries: - `vllm (recommended)`: See here - `mistral-inference`: See here - `transformers`: See here - `LMStudio`: See here - `llama.cpp`: See here - `ollama`: See here Expand = 0.9.1`](https://github.com/vllm-project/vllm/releases/tag/v0.9.1): Also make sure to have installed `mistralcommon >= 1.7.0`. You can also make use of a ready-to-go docker image or on the docker hub. We recommand that you use Devstral in a server/client setting. 2. To ping the client you can use a simple Python snippet. Then load our tokenizer along with the model and generate: Make sure you launched an OpenAI-compatible server such as vLLM or Ollama as described above. Then, you can use OpenHands to interact with `Devstral Small 1.1`. In the case of the tutorial we spineed up a vLLM server running the command: The server address should be in the following format: `http:// :8000/v1` The easiest way to launch OpenHands is to use the Docker image: Then, you can access the OpenHands UI at `http://localhost:3000`. When accessing the OpenHands UI, you will be prompted to connect to a server. You can use the advanced mode to connect to the server you launched earlier. Fill the following fields: - Custom Model: `openai/mistralai/Devstral-Small-2507` - Base URL: `http:// :8000/v1` - API Key: `token` (or any other token you used to launch the server if any) Make sure you launched an OpenAI-compatible server such as vLLM or Ollama as described above. Then, you can use OpenHands to interact with `Devstral Small 1.1`. In the case of the tutorial we spineed up a vLLM server running the command: The server address should be in the following format: `http:// :8000/v1` You can follow installation of Cline here. Then you can configure the server address in the settings. OpenHands:Understanding Test Coverage of Mistral Common We can start the OpenHands scaffold and link it to a repo to analyze test coverage and identify badly covered files. Here we start with our public `mistral-common` repo. After the repo is mounted in the workspace, we give the following instruction The agent will first browse the code base to check test configuration and structure. Then it sets up the testing dependencies and launches the coverage test: Finally, the agent writes necessary code to visualize the coverage, export the results and save the plots to a png. At the end of the run, the following plots are produced: First initialize Cline inside VSCode and connect it to the server you launched earlier. We give the following instruction to builde the video game: Don't hesitate to iterate or give more information to Devstral to improve the game!

license:apache-2.0
33,120
70

Qwen3-14B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-14B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 14.8B - Number of Paramaters (Non-Embedding): 13.2B - Number of Layers: 40 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-14B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-14B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-14B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "type": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
license:apache-2.0
32,939
14

Kimi-K2-Instruct-GGUF

Learn how to run Kimi-K2 Dynamic GGUFs - Read our Guide! Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - You can now use the latest update of llama.cpp to run the model. - For complete detailed instructions, see our guide: docs.unsloth.ai/basics/kimi-k2 It is recommended to have at least 128GB unified RAM memory to run the small quants. With 16GB VRAM and 256 RAM, expect 5+ tokens/sec. For best results, use any 2-bit XL quant or above. Set the temperature to 0.6 recommended) to reduce repetition and incoherence. 📰&nbsp;&nbsp; Tech Blog &nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp; 📄&nbsp;&nbsp;Paper Link (coming soon) 2025.7.15 - We have updated our tokenizer implementation. Now special tokens like `[EOS]` can be encoded to their token ids. - We fixed a bug in the chat template that was breaking multi-turn tool calls. Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities. Key Features - Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability. - MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up. - Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving. Model Variants - Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions. - Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking. | | | |:---:|:---:| | Architecture | Mixture-of-Experts (MoE) | | Total Parameters | 1T | | Activated Parameters | 32B | | Number of Layers (Dense layer included) | 61 | | Number of Dense Layers | 1 | | Attention Hidden Dimension | 7168 | | MoE Hidden Dimension (per Expert) | 2048 | | Number of Attention Heads | 64 | | Number of Experts | 384 | | Selected Experts per Token | 8 | | Number of Shared Experts | 1 | | Vocabulary Size | 160K | | Context Length | 128K | | Attention Mechanism | MLA | | Activation Function | SwiGLU | Benchmark Metric Kimi K2 Instruct DeepSeek-V3-0324 Qwen3-235B-A22B (non-thinking) Claude Sonnet 4 (w/o extended thinking) Claude Opus 4 (w/o extended thinking) GPT-4.1 Gemini 2.5 Flash Preview (05-20) LiveCodeBench v6 (Aug 24 - May 25) Pass@1 53.7 46.9 37.0 48.5 47.4 44.7 44.7 MultiPL-E Pass@1 85.7 83.1 78.2 88.6 89.6 86.7 85.6 SWE-bench Verified (Agentless Coding) Single Patch w/o Test (Acc) 51.8 36.6 39.4 50.2 53.0 40.8 32.6 SWE-bench Verified (Agentic Coding) Single Attempt (Acc) 65.8 38.8 34.4 72.7 72.5 54.6 — (Agentic Coding) --> Multiple Attempts (Acc) 71.6 — — 80.2 79.4 — — SWE-bench Multilingual (Agentic Coding) Single Attempt (Acc) 47.3 25.8 20.9 51.0 — 31.5 — TerminalBench Inhouse Framework (Acc) 30.0 — — 35.5 43.2 8.3 — TerminalBench --> Terminus (Acc) 25.0 16.3 6.6 — — 30.3 16.8 Aider-Polyglot Acc 60.0 55.1 61.8 56.4 70.7 52.4 44.0 Tau2 retail Avg@4 70.6 69.1 57.0 75.0 81.8 74.8 64.3 Tau2 airline Avg@4 56.5 39.0 26.5 55.5 60.0 54.5 42.5 Tau2 telecom Avg@4 65.8 32.5 22.1 45.2 57.0 38.6 16.9 AIME 2024 Avg@64 69.6 59.4 40.1 43.4 48.2 46.5 61.3 AIME 2025 Avg@64 49.5 46.7 24.7 33.1 33.9 37.0 46.6 HMMT 2025 Avg@32 38.8 27.5 11.9 15.9 15.9 19.4 34.7 CNMO 2024 Avg@16 74.3 74.7 48.6 60.4 57.6 56.6 75.0 PolyMath-en Avg@4 65.1 59.5 51.9 52.8 49.8 54.0 49.9 GPQA-Diamond Avg@8 75.1 68.4 62.9 70.0 74.9 66.3 68.2 Humanity's Last Exam (Text Only) - 4.7 5.2 5.7 5.8 7.1 3.7 5.6 IFEval Prompt Strict 89.8 81.1 83.2 87.6 87.4 88.0 84.3 Multi-Challenge Acc 54.1 31.4 34.0 46.8 49.0 36.4 39.5 SimpleQA Correct 31.0 27.7 13.2 15.9 22.8 42.3 23.3 Livebench Pass@1 76.4 72.4 67.6 74.8 74.6 69.8 67.8 • Bold denotes global SOTA, and underlined denotes open-source SOTA. • Data points marked with are taken directly from the model's tech report or blog. • All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length. • Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model. • To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2. • Some data points have been omitted due to prohibitively expensive evaluation costs. Benchmark Metric Shot Kimi K2 Base Deepseek-V3-Base Qwen2.5-72B Llama 4 Maverick • We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study. • All models are evaluated using the same evaluation protocol. 4. Deployment > [!Note] > You can access Kimi K2's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you. > > The Anthropic-compatible API maps temperature by `realtemperature = requesttemperature 0.6` for better compatible with existing applications. Our model checkpoints are stored in the block-fp8 format, you can find it on Huggingface. Currently, Kimi-K2 is recommended to run on the following inference engines: Deployment examples for vLLM and SGLang can be found in the Model Deployment Guide. Once the local inference service is up, you can interact with it through the chat endpoint: > [!NOTE] > The recommended temperature for Kimi-K2-Instruct is `temperature = 0.6`. > If no special instructions are required, the system prompt above is a good default. Kimi-K2-Instruct has strong tool-calling capabilities. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them. The following example demonstrates calling a weather tool end-to-end: The `toolcallwithclient` function implements the pipeline from user query to tool execution. This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic. For streaming output and manual tool-parsing, see the Tool Calling Guide. Both the code repository and the model weights are released under the Modified MIT License. If you have any questions, please reach out at [email protected].

32,597
209

gemma-3-27b-it

NaNK
32,461
13

mistral-7b-bnb-4bit

NaNK
license:apache-2.0
32,212
29

Qwen2.5-VL-7B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
31,479
50

Llama-4-Scout-17B-16E-Instruct-GGUF

NaNK
llama4
31,142
124

Qwen3-Coder-30B-A3B-Instruct

NaNK
license:apache-2.0
31,117
18

Qwen3-0.6B-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | To Switch Between Thinking and Non-Thinking If you are using llama.cpp, Ollama, Open WebUI etc., you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-0.6B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-0.6B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
license:apache-2.0
31,091
81

Qwen2.5-3B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
30,936
7

Qwen3-8B

NaNK
license:apache-2.0
30,010
12

gpt-oss-safeguard-20b-GGUF

NaNK
license:apache-2.0
29,856
6

Qwen2.5-7B-Instruct-bnb-4bit

NaNK
license:apache-2.0
29,649
17

Qwen3-VL-4B-Instruct-GGUF

NaNK
license:apache-2.0
29,362
19

gpt-oss-20b

Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of these open models: - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single H100 GPU (117B parameters with 5.1B active parameters) - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. > [!NOTE] > This model card is dedicated to the smaller `gpt-oss-20b` model. Check out `gpt-oss-120b` for the larger model. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning. Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package. To get started, install the necessary dependencies to setup your environment: Once, setup you can proceed to run the model by running the snippet below: Alternatively, you can run the model via `Transformers Serve` to spin up a OpenAI-compatible webserver: Learn more about how to use gpt-oss with Transformers. vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository. If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama. If you are using LM Studio you can use the following commands to download. Check out our awesome list for a broader collection of gpt-oss resources and inference partners. You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI: You can adjust the reasoning level that suits your task across three levels: Low: Fast responses for general dialogue. Medium: Balanced speed and detail. High: Deep and detailed analysis. The reasoning level can be set in the system prompts, e.g., "Reasoning: high". The gpt-oss models are excellent for: Web browsing (using built-in browsing tools) Function calling with defined schemas Agentic operations like browser tasks Both gpt-oss models can be fine-tuned for a variety of specialized use cases. This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger `gpt-oss-120b` can be fine-tuned on a single H100 node.

NaNK
license:apache-2.0
29,039
36

Qwen2.5-7B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
28,547
2

granite-4.0-h-tiny-GGUF

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
28,058
27

Llama-3.1-8B-Instruct

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.1 family of models. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License: A custom commercial license, the Llama 3.1 Community License, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama31/LICENSE Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here. Intended Use Cases Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card. Note : Llama 3.1 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.1 models for languages beyond the 8 supported languages provided they comply with the Llama 3.1 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.1 in additional languages is done in a safe and responsible manner. This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at `huggingface-llama-recipes` LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting here. Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool: You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation. To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure. Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq. The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023. In this section, we report the results for Llama 3.1 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Llama is a foundational technology designed to be used in a variety of use cases, examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology power, by aligning our model safety for the generic use cases addressing a standard set of harms. Developers are then in the driver seat to tailor safety for their use case, defining their own policy and deploying the models with the necessary safeguards in their Llama systems. Llama 3.1 was developed following the best practices outlined in our Responsible Use Guide, you can refer to the Responsible Use Guide to learn more. Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Large language models, including Llama 3.1, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard 3, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. Note that this release introduces new capabilities, including a longer context window, multilingual inputs and outputs and possible integrations by developers with third party tools. Building with these new capabilities requires specific considerations in addition to the best practices that generally apply across all Generative AI use cases. Tool-use: Just like in standard software development, developers are responsible for the integration of the LLM with the tools and services of their choice. They should define a clear policy for their use case and assess the integrity of the third party services they use to be aware of the safety and security limitations when using this capability. Refer to the Responsible Use Guide for best practices on the safe deployment of the third party safeguards. Multilinguality: Llama 3.1 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Llama may be able to output text in other languages than those that meet performance thresholds for safety and helpfulness. We strongly discourage developers from using this model to converse in non-supported languages without implementing finetuning and system controls in alignment with their policies and the best practices shared in the Responsible Use Guide. We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, coding assistant, tool calls. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, tools calls, coding or memorization. For both scenarios, we conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. We specifically focused our efforts on mitigating the following critical risk areas: 1- CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Our study of Llama-3.1-405B’s social engineering uplift for cyber attackers was conducted to assess the effectiveness of AI models in aiding cyber threat actors in spear phishing campaigns. Please read our Llama 3.1 Cyber security whitepaper to learn more. Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. The core values of Llama 3.1 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.1 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3.1 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.1’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.1 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

NaNK
llama
27,985
7

Qwen3-VL-30B-A3B-Thinking-GGUF

NaNK
license:apache-2.0
27,801
12

Qwen2.5-14B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
27,667
2

phi-4

llama
27,665
88

Magistral-Small-2509-GGUF

Learn to run Magistral 1.2 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves SOTA performance in model quantization. Read our in-depth guide about Magistral 1.2: docs.unsloth.ai/basics/magistral - Fine-tune Magistral 1.2 for free using our Kaggle notebook here-Reasoning-Conversational.ipynb&accelerator=nvidiaTeslaT4)! - View the rest of our notebooks in our docs here. Building upon Mistral Small 3.2 (2506), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. - Multimodality: The model now has a vision encoder and can take multimodal inputs, extending its reasoning capabilities to vision. - Performance upgrade: Magistral Small 1.2 should give you significatively better performance than Magistral Small 1.1 as seen in the benchmark results. - Better tone and persona: You should experiment better LaTeX and Markdown formatting, and shorter answers on easy general prompts. - Finite generation: The model is less likely to enter infinite generation loops. - Special think tokens: [THINK] and [/THINK] special tokens encapsulate the reasoning content in a thinking chunk. This makes it easier to parse the reasoning trace and prevents confusion when the '[THINK]' token is given as a string in the prompt. - Reasoning prompt: The reasoning prompt is given in the system prompt. - Reasoning: Capable of long chains of reasoning traces before providing an answer. - Multilingual: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi. - Vision: Vision capabilities enable the model to analyze images and reason based on visual content in addition to text. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. Performance might degrade past 40k but Magistral should still give good results. Hence we recommend to leave the maximum model length to 128k and only lower if you encounter low performance. | Model | AIME24 pass@1 | AIME25 pass@1 | GPQA Diamond | Livecodebench (v5) | |--------------------------|---------------|---------------|--------------|--------------------| | Magistral Medium 1.2 | 91.82% | 83.48% | 76.26% | 75.00% | | Magistral Medium 1.1 | 72.03% | 60.99% | 71.46% | 59.35% | | Magistral Medium 1.0 | 73.59% | 64.95% | 70.83% | 59.36% | | Magistral Small 1.2 | 86.14% | 77.34% | 70.07% | 70.88% | | Magistral Small 1.1 | 70.52% | 62.03% | 65.78% | 59.17% | | Magistral Small 1.0 | 70.68% | 62.76% | 68.18% | 55.84% | Please make sure to use: - `topp`: 0.95 - `temperature`: 0.7 - `maxtokens`: 131072 We highly recommend including the following system prompt for the best results, you can edit and customise it if needed for your specific use case. The `[THINK]` and `[/THINK]` are special tokens that must be encoded as such. Please make sure to use mistral-common as the source of truth. Find below examples from libraries supporting `mistral-common`. We invite you to choose, depending on your use case and requirements, between keeping reasoning traces during multi-turn interactions or keeping only the final assistant response. Make sure you install the latest `Transformers` version:

license:apache-2.0
27,283
85

Llama-3.2-3B-Instruct-GGUF

NaNK
llama
26,526
50

DeepSeek-R1-Distill-Llama-8B-unsloth-bnb-4bit

NaNK
llama
26,327
36

Llama-3.2-1B-Instruct-GGUF

See our collection for all versions of Llama 3.2 including GGUF, 4-bit and original 16-bit formats. 16bit, 8bit, 6bit, 5bit, 4bit, 3bit and 2bit uploads avaliable. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1T5-zKWM5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing unsloth/Llama-3.2-1B-Instruct For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK
llama
26,152
44

gemma-3-1b-it-unsloth-bnb-4bit

NaNK
25,755
6

gemma-3-270m-it-GGUF

25,431
128

DeepSeek-R1-GGUF

See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats. Unsloth's DeepSeek-R1 1.58-bit + 2-bit Dynamic Quants is selectively quantized, greatly improving accuracy over standard 1-bit/2-bit. Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic 1. Do not forget about ` ` and ` ` tokens! - Or use a chat template formatter 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp. You can follow the build instructions below as well: 3. It's best to use `--min-p 0.05` to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. 4. Download the model via: 5. Example with Q40 K quantized cache Notice -no-cnv disables auto conversation mode 6. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers. 7. If you want to merge the weights together, use this script: | MoE Bits | Type | Disk Size | Accuracy | Link | Details | | -------- | -------- | ------------ | ------------ | ---------------------| ---------- | | 1.58bit | UD-IQ1S | 131GB | Fair | Link | MoE all 1.56bit. `downproj` in MoE mixture of 2.06/1.56bit | | 1.73bit | UD-IQ1M | 158GB | Good | Link | MoE all 1.56bit. `downproj` in MoE left at 2.06bit | | 2.22bit | UD-IQ2XXS | 183GB | Better | Link | MoE all 2.06bit. `downproj` in MoE mixture of 2.5/2.06bit | | 2.51bit | UD-Q2KXL | 212GB | Best | Link | MoE all 2.5bit. `downproj` in MoE mixture of 3.5/2.5bit | Finetune your own Reasoning model like R1 with Unsloth! We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-GRPO.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Phi-4 (14B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the DeepSeek team for creating and releasing these models. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section. Post-Training: Large-Scale Reinforcement Learning on the Base Model - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-R1-Zero | 671B | 37B | 128K | 🤗 HuggingFace | | DeepSeek-R1 | 671B | 37B | 128K | 🤗 HuggingFace | DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository. | Model | Base Model | Download | | :------------: | :------------: | :------------: | | DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 🤗 HuggingFace | |DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 🤗 HuggingFace | DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers. Please use our setting to run these models. DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | 92.9 | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | 84.0 | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | 92.2 | | | IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | 82.5 | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | 87.6 | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | 92.3 | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | 65.9 | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 | 2029 | | | SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | 79.8 | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | 97.3 | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | 78.8 | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | 92.8 | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | 91.8 | | | C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - | 63.7 | | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | 86.7 | 94.5 | 65.2 | 57.5 | 1633 | 5. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally. DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using vLLM: We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. Avoid adding a system prompt; all instructions should be contained within the user prompt. 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. 7. License This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that: - DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1. - DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license. - DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license. 9. Contact If you have any questions, please raise an issue or contact us at [email protected].

license:mit
25,186
1,096

Qwen3-32B-GGUF

NaNK
license:apache-2.0
25,086
88

Llama-3.2-11B-Vision-Instruct

See our collection for vision models including Llama 3.2, Llava, Qwen2-VL and Pixtral. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 Vision (11B) here: https://colab.research.google.com/drive/1j0N4XTY1zXXy7mPAhOC1gMYZ2F2EBlk?usp=sharing unsloth/Llama-3.2-11B-Vision-Instruct For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK
mllama
24,909
85

Qwen3-8B-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | To Switch Between Thinking and Non-Thinking If you are using llama.cpp, Ollama, Open WebUI etc., you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-8B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 8.2B - Number of Paramaters (Non-Embedding): 6.95B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-8B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-8B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "type": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
license:apache-2.0
24,718
74

gemma-3n-E4B-it-unsloth-bnb-4bit

NaNK
23,989
9

JanusCoder-8B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. 💻Github Repo • 🤗Model Collections • 📜Technical Report We introduce JanusCoder and JanusCoderV, a suite of open-source foundational models designed to establish a unified visual-programmatic interface for code intelligence. This model suite is built upon open-source language models (such as Qwen3-8B and 14B) and multimodal models (such as Qwen2.5-VL and InternVL3.5-8B). The JanusCoder series is trained on JANUSCODE-800K—the largest multimodal code corpus to date, generated by an innovative synthesis toolkit, covering everything from standard charts to complex interactive Web UIs and code-driven animations. This enables the models to uniformly handle diverse visual-programmatic tasks, such as generating code from textual instructions, visual inputs, or a combination of both, rather than building specialized models for isolated tasks. JanusCoder excels at flexible content generation (like data visualizations and interactive front-ends) as well as precise, program-driven editing of visual effects and complex animation construction. | Model Name | Description | Download | | --- | --- | --- | | 👉 JanusCoder-8B | 8B text model based on Qwen3-8B. | 🤗 Model | | JanusCoder-14B | 14B text model based on Qwen3-14B. | 🤗 Model | | JanusCoderV-7B | 7B multimodal model based on Qwen2.5-VL-7B. | 🤗 Model | | JanusCoderV-8B | 8B multimodal model based on InternVL3.5-8B. | 🤗 Model | We evaluate the JanusCoder model on various benchmarks that span code interlligence tasks on multiple PLs: | Model | JanusCoder-8B | Qwen3-8B | Qwen2.5-Coder-7B-Instruct | LLaMA3-8B-Instruct | GPT-4o | | --- | --- | --- | --- | --- | --- | | PandasPlotBench (Task) | 80 | 74 | 76 | 69 | 85 | | ArtifactsBench | 39.6 | 36.5 | 26.0 | 36.5 | 37.9 | | DTVBench (Manim) | 9.70 | 6.20 | 8.56 | 4.92 | 10.60 | | DTVBench (Wolfram) | 6.07 | 5.18 | 4.04 | 3.15 | 5.97 | The following provides demo code illustrating how to generate text using JanusCoder-8B. > Please use transformers >= 4.55.0 to ensure the model works normally. Citation 🫶 If you are interested in our work or find the repository / checkpoints / benchmark / data helpful, please consider using the following citation format when referencing our papers:

NaNK
license:apache-2.0
23,890
2

Qwen3-Coder-30B-A3B-Instruct-1M-GGUF

> [!NOTE] > Extends context length from 256K to 1 million > See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-Coder correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Qwen3-Coder is available in multiple sizes. Today, we're excited to introduce Qwen3-Coder-30B-A3B-Instruct. This streamlined model maintains impressive performance and efficiency, featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platform such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-Coder-30B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-Coder-30B-A3B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
23,867
120

Qwen3-4B-Base

NaNK
license:apache-2.0
23,557
4

Qwen3-0.6B

NaNK
23,095
15

Qwen3-4B-Thinking-2507

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Over the past three months, we have continued to scale the thinking capability of Qwen3-4B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-4B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-4B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Qwen3-30B-A3B Thinking | Qwen3-4B Thinking | Qwen3-4B-Thinking-2507 | |--- | --- | --- | --- | | Knowledge | | | | MMLU-Pro | 78.5 | 70.4 | 74.0 | | MMLU-Redux | 89.5 | 83.7 | 86.1 | | GPQA | 65.8 | 55.9 | 65.8 | | SuperGPQA | 51.8 | 42.7 | 47.8 | | Reasoning | | | | AIME25 | 70.9 | 65.6 | 81.3 | | HMMT25 | 49.8 | 42.1 | 55.5 | | LiveBench 20241125 | 74.3 | 63.6 | 71.8 | | Coding | | | | LiveCodeBench v6 (25.02-25.05) | 57.4 | 48.4 | 55.2 | | CFEval | 1940 | 1671 | 1852 | | OJBench | 20.7 | 16.1 | 17.9 | | Alignment | | | | IFEval | 86.5 | 81.9 | 87.4 | | Arena-Hard v2$ | 36.3 | 13.7 | 34.9 | | Creative Writing v3 | 79.1 | 61.1 | 75.6 | | WritingBench | 77.0 | 73.5 | 83.3 | | Agent | | | | BFCL-v3 | 69.1 | 65.9 | 71.2 | | TAU1-Retail | 61.7 | 33.9 | 66.1 | | TAU1-Airline | 32.0 | 32.0 | 48.0 | | TAU2-Retail | 34.2 | 38.6 | 53.5 | | TAU2-Airline | 36.0 | 28.0 | 58.0 | | TAU2-Telecom | 22.8 | 17.5 | 27.2 | | Multilingualism | | | | MultiIF | 72.2 | 66.3 | 77.3 | | MMLU-ProX | 73.1 | 61.0 | 64.2 | | INCLUDE | 71.9 | 61.8 | 64.4 | | PolyMATH | 46.1 | 40.0 | 46.2 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-4B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-4B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-4B-Thinking-2507 --served-model-name Qwen3-4B-Thinking-2507 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-4B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
23,075
6

Qwen3-4B-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | To Switch Between Thinking and Non-Thinking If you are using llama.cpp, Ollama, Open WebUI etc., you can add `/think` and `/nothink` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-4B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-4B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-4B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-4B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python import os from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'What time is it?'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "type": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
license:apache-2.0
23,024
84

Qwen3-VL-8B-Thinking-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! > See our Qwen3-VL collection for all versions including GGUF, 4-bit & 16-bit formats. See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3-VL-8B for free using our Google Colab notebook - Or train Qwen3-VL with reinforcement learning (GSPO) with our free notebook. - View the rest of our notebooks in our docs here. --- Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
22,852
9

orpheus-3b-0.1-ft

03/18/2025 – We are releasing our 3B Orpheus TTS model with additional finetunes. Code is available on GitHub: CanopyAI/Orpheus-TTS Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances. - Human-Like Speech: Natural intonation, emotion, and rhythm that is superior to SOTA closed source models - Zero-Shot Voice Cloning: Clone voices without prior fine-tuning - Guided Emotion and Intonation: Control speech and emotion characteristics with simple tags - Low Latency: ~200ms streaming latency for realtime applications, reducible to ~100ms with input streaming - GitHub Repo: https://github.com/canopyai/Orpheus-TTS - Blog Post: https://canopylabs.ai/model-releases - Colab Inference Notebook: notebook link Check out our Colab (link to Colab) or GitHub (link to GitHub) on how to run easy inference on our finetuned models. Model Misuse Do not use our models for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines. We disclaim responsibility for any use.

NaNK
llama
22,623
9

Qwen3-4B

NaNK
22,559
15

Qwen2.5-VL-7B-Instruct-bnb-4bit

NaNK
license:apache-2.0
22,110
13

gemma-3n-E2B-it-GGUF

Learn how to run & fine-tune Gemma 3n correctly - Read our Guide . See our collection for all versions of Gemma 3n including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves SOTA accuracy & performance versus other quants. - Currently only text is supported. - Ollama: `ollama run hf.co/unsloth/gemma-3n-E4B-it-GGUF:Q4KXL` - auto-sets correct chat template and settings - Set temperature = 1.0, topk = 64, topp = 0.95, minp = 0.0 - Gemma 3n max tokens (context length): 32K. Gemma 3n chat template: - For complete detailed instructions, see our step-by-step guide. - Fine-tune Gemma 3n (4B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3n support: unsloth.ai/blog/gemma-3n - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma-3n-E4B | ▶️ Start on Colab | 2x faster | 60% less | | GRPO with Gemma 3 (1B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Gemma 3 (4B) Vision | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen3 (14B) | ▶️ Start on Colab-Reasoning-Conversational.ipynb) | 2x faster | 60% less | | DeepSeek-R1-0528-Qwen3-8B (14B) | ▶️ Start on ColabGRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | - Responsible Generative AI Toolkit - Gemma on Kaggle - Gemma on HuggingFace - Gemma on Vertex Model Garden Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages. Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each - Audio data encoded to 6.25 tokens per second from a single channel - Total input context of 32K tokens - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output length up to 32K tokens, subtracting the request input tokens Usage Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3n is supported starting from transformers 4.53.0. Then, copy the snippet from the section that is relevant for your use case. You can initialize the model and processor for inference with `pipeline` as follows. With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline. Data used for model training and how the data was processed. These models were trained on a dataset that includes a wide variety of sources totalling approximately 11 trillion tokens. The knowledge cutoff date for the training data was June 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. - Audio: A diverse set of sound samples enables the model to recognize speech, transcribe text from recordings, and identify information in audio data. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with our policies. Implementation Information Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training generative models requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training generative models. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. These advantages are aligned with Google's commitments to operate sustainably. Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models: "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated at full precision (float32) against a large collection of different datasets and metrics to cover different aspects of content generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | Metric | n-shot | E2B PT | E4B PT | | ------------------------------ |----------------|----------|:--------:|:--------:| | [HellaSwag][hellaswag] | Accuracy | 10-shot | 72.2 | 78.6 | | [BoolQ][boolq] | Accuracy | 0-shot | 76.4 | 81.6 | | [PIQA][piqa] | Accuracy | 0-shot | 78.9 | 81.0 | | [SocialIQA][socialiqa] | Accuracy | 0-shot | 48.8 | 50.0 | | [TriviaQA][triviaqa] | Accuracy | 5-shot | 60.8 | 70.2 | | [Natural Questions][naturalq] | Accuracy | 5-shot | 15.5 | 20.9 | | [ARC-c][arc] | Accuracy | 25-shot | 51.7 | 61.6 | | [ARC-e][arc] | Accuracy | 0-shot | 75.8 | 81.6 | | [WinoGrande][winogrande] | Accuracy | 5-shot | 66.8 | 71.7 | | [BIG-Bench Hard][bbh] | Accuracy | few-shot | 44.3 | 52.9 | | [DROP][drop] | Token F1 score | 1-shot | 53.9 | 60.8 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------|-------------------------|----------|:--------:|:--------:| | [MGSM][mgsm] | Accuracy | 0-shot | 53.1 | 60.7 | | [WMT24++][wmt24pp] (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 | | [Include][include] | Accuracy | 0-shot | 38.6 | 57.2 | | [MMLU][mmlu] (ProX) | Accuracy | 0-shot | 8.1 | 19.9 | | [OpenAI MMLU][openai-mmlu] | Accuracy | 0-shot | 22.3 | 35.6 | | [Global-MMLU][global-mmlu] | Accuracy | 0-shot | 55.1 | 60.3 | | [ECLeKTic][eclektic] | ECLeKTic score | 0-shot | 2.5 | 1.9 | [mgsm]: https://arxiv.org/abs/2210.03057 [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [include]:https://arxiv.org/abs/2411.19799 [mmlu]: https://arxiv.org/abs/2009.03300 [openai-mmlu]: https://huggingface.co/datasets/openai/MMMLU [global-mmlu]: https://huggingface.co/datasets/CohereLabs/Global-MMLU [eclektic]: https://arxiv.org/abs/2502.21228 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------|--------------------------|----------|:--------:|:--------:| | [GPQA][gpqa] Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 | | [LiveCodeBench][lcb] v5 | pass@1 | 0-shot | 18.6 | 25.7 | | Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 | | [AIME 2025][aime-2025] | Accuracy | 0-shot | 6.7 | 11.6 | [gpqa]: https://arxiv.org/abs/2311.12022 [lcb]: https://arxiv.org/abs/2403.07974 [aime-2025]: https://www.vals.ai/benchmarks/aime-2025-05-09 | Benchmark | Metric | n-shot | E2B IT | E4B IT | | ------------------------------------ |------------|----------|:--------:|:--------:| | [MMLU][mmlu] | Accuracy | 0-shot | 60.1 | 64.9 | | [MBPP][mbpp] | pass@1 | 3-shot | 56.6 | 63.6 | | [HumanEval][humaneval] | pass@1 | 0-shot | 66.5 | 75.0 | | [LiveCodeBench][lcb] | pass@1 | 0-shot | 13.2 | 13.2 | | HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 | | [Global-MMLU-Lite][global-mmlu-lite] | Accuracy | 0-shot | 59.0 | 64.5 | | [MMLU][mmlu] (Pro) | Accuracy | 0-shot | 40.5 | 50.6 | [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to high severity violations. A limitation of our evaluations was they included primarily English language prompts. These models have certain limitations that users should be aware of. Open generative models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: Extract, interpret, and summarize visual data for text communications. - Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data. - Research and Education - Natural Language Processing (NLP) and generative model Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics. Limitations - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. Ethical Considerations and Risks The development of generative models raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - Generative models trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - Generative models can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making generative model technology accessible to developers and researchers across the AI ecosystem. Risks identified and mitigations: - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of generative models. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. Benefits At the time of release, this family of models provides high-performance open generative model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives.

NaNK
21,829
50

Qwen3-30B-A3B-Thinking-2507-GGUF

NaNK
license:apache-2.0
21,566
122

Kimi-K2-Thinking-GGUF

Nov 8: We collabed with the Kimi team on a system prompt fix. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. It is recommended to have 247 GB of RAM to run the 1-bit Dynamic GGUF. To run the model in full precision, you can use 'UD-Q4KXL', which requires 646 GB RAM. Kimi K2 Thinking is the latest, most capable version of open-source thinking model. Starting with Kimi K2, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage. Key Features - Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift. - Native INT4 Quantization: Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode. - Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps. | | | |:---:|:---:| | Architecture | Mixture-of-Experts (MoE) | | Total Parameters | 1T | | Activated Parameters | 32B | | Number of Layers (Dense layer included) | 61 | | Number of Dense Layers | 1 | | Attention Hidden Dimension | 7168 | | MoE Hidden Dimension (per Expert) | 2048 | | Number of Attention Heads | 64 | | Number of Experts | 384 | | Selected Experts per Token | 8 | | Number of Shared Experts | 1 | | Vocabulary Size | 160K | | Context Length | 256K | | Attention Mechanism | MLA | | Activation Function | SwiGLU | Reasoning Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:| | HLE (Text-only) | no tools | 23.9 | 26.3 | 19.8 | 7.9 | 19.8 | 25.4 | | | w/ tools | 44.9 | 41.7 | 32.0 | 21.7 | 20.3 | 41.0 | | | heavy | 51.0 | 42.0 | - | - | - | 50.7 | | AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 | | | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1 | 98.8 | | | heavy | 100.0 | 100.0 | - | - | - | 100.0 | | HMMT25 | no tools | 89.4 | 93.3 | 74.6 | 38.8 | 83.6 | 90.0 | | | w/ python | 95.1 | 96.7 | 88.8 | 70.4 | 49.5 | 93.9 | | | heavy | 97.5 | 100.0 | - | - | - | 96.7 | | IMO-AnswerBench | no tools | 78.6 | 76.0 | 65.9 | 45.8 | 76.0 | 73.1 | | GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 | General Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:| | MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | | MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | | Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | | HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | Agentic Search Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:| | BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | | BrowseComp-ZH | w/ tools | 62.3 | 63.0 | 42.4 | 22.2 | 47.9 | | Seal-0 | w/ tools | 56.3 | 51.4 | 53.4 | 25.2 | 38.5 | | FinSearchComp-T3 | w/ tools | 47.4 | 48.5 | 44.0 | 10.4 | 27.0 | | Frames | w/ tools | 87.0 | 86.0 | 85.0 | 58.1 | 80.2 | Coding Tasks | Benchmark | Setting | K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:| | SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | | SWE-bench Multilingual | w/ tools | 61.1 | 55.3 | 68.0 | 55.9 | 57.9 | | Multi-SWE-bench | w/ tools | 41.9 | 39.3 | 44.3 | 33.5 | 30.6 | | SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | | LiveCodeBenchV6 | no tools | 83.1 | 87.0 | 64.0 | 56.1 | 74.1 | | OJ-Bench (cpp) | no tools | 48.7 | 56.2 | 30.4 | 25.5 | 38.2 | | Terminal-Bench | w/ simulated tools (JSON) | 47.1 | 43.8 | 51.0 | 44.5 | 37.7 | 1. To ensure a fast, lightweight experience, we selectively employ a subset of tools and reduce the number of tool call steps under the chat mode on kimi.com. As a result, chatting on kimi.com may not reproduce our benchmark scores. Our agentic mode will be updated soon to reflect the full capabilities of K2 Thinking. 2. Testing Details: 2.1. All benchmarks were evaluated at temperature = 1.0 and 256 k context length for K2 Thinking, except for SciCode, for which we followed the official temperature setting of 0.0. 2.2. HLE (no tools), AIME25, HMMT25, and GPQA were capped at a 96k thinking-token budget, while IMO-Answer Bench, LiveCodeBench and OJ-Bench were capped at a 128k thinking-token budget. Longform Writing was capped at a 32k completion-token budget. 2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8). 3. Baselines: 3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the GPT-5 post, GPT-5 for Developers post, GPT-5 system card, claude-sonnet-4-5 post, grok-4 post, deepseek-v3.2 post, the public Terminal-Bench leaderboard (Terminus-2), the public Vals AI leaderboard and artificialanalysis. Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(). For the GPT-5 test, we set the reasoning effort to high. 3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4’s launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from Scale.ai . The official GPT5 HLE full set w/o tool is 24.8. 3.3 For IMO-AnswerBench : GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76. 4. For HLE (w/ tools) and the agentic-search benchmarks: 4.1. K2 Thinking was equipped with search, code-interpreter, and web-browsing tools. 4.2. BrowseComp-ZH, Seal-0 and FinSearchComp-T3 were run 4 times independently and the average is reported (avg@4). 4.3. The evaluation used o3-mini as judge, configured identically to the official HLE setting; judge prompts were taken verbatim from the official repository. 4.4. On HLE, the maximum step limit was 120, with a 48 k-token reasoning budget per step; on agentic-search tasks, the limit was 300 steps with a 24 k-token reasoning budget per step. 4.5. When tool execution results cause the accumulated input to exceed the model's context limit (256k), we employ a simple context management strategy that hides all previous tool outputs. 4.6. The web access to Hugging Face may lead to data leakage in certain benchmark tests, such as HLE. K2 Thinking can achieve a score of 51.3 on HLE without blocking Hugging Face. To ensure a fair and rigorous comparison, we blocked access to Hugging Face during testing. 5. For Coding Tasks: 5.1. Terminal-Bench scores were obtained with the default agent framework (Terminus-2) and the provided JSON parser. 5.2. For other coding tasks, the result was produced with our in-house evaluation harness. The harness is derived from SWE-agent, but we clamp the context windows of the Bash and Edit tools and rewrite the system prompt to match the task semantics. 5.3. All reported scores of coding tasks are averaged over 5 independent runs. 6. Heavy Mode: K2 Thinking Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result. Heavy mode for GPT-5 denotes the official GPT-5 Pro score. Low-bit quantization is an effective way to reduce inference latency and GPU memory usage on large-scale inference servers. However, thinking models use excessive decoding lengths, and thus quantization often results in substantial performance drops. To overcome this challenge, we adopt Quantization-Aware Training (QAT) during the post-training phase, applying INT4 weight-only quantization to the MoE components. It allows K2 Thinking to support native INT4 inference with a roughly 2x generation speed improvement while achieving state-of-the-art performance. All benchmark results are reported under INT4 precision. The checkpoints are saved in compressed-tensors format, supported by most of mainstream inference engine. If you need the checkpoints in higher precision such as FP8 or BF16, you can refer to official repo of compressed-tensors to unpack the int4 weights and convert to any higher precision. 5. Deployment > [!Note] > You can access K2 Thinking's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you. Currently, Kimi-K2-Thinking is recommended to run on the following inference engines: Deployment examples can be found in the Model Deployment Guide. Once the local inference service is up, you can interact with it through the chat endpoint: > [!NOTE] > The recommended temperature for Kimi-K2-Thinking is `temperature = 1.0`. > If no special instructions are required, the system prompt above is a good default. Kimi-K2-Thinking has the same tool calling settings as Kimi-K2-Instruct. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them. The following example demonstrates calling a weather tool end-to-end: The `toolcallwithclient` function implements the pipeline from user query to tool execution. This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic. For more information, see the Tool Calling Guide. Both the code repository and the model weights are released under the Modified MIT License. If you have any questions, please reach out at [email protected].

NaNK
21,431
74

Qwen3-VL-2B-Thinking-GGUF

NaNK
license:apache-2.0
21,338
6

DeepSeek-R1-Distill-Qwen-14B-GGUF

NaNK
license:apache-2.0
21,187
86

mistral-7b-instruct-v0.2-bnb-4bit

NaNK
license:apache-2.0
20,779
33

Qwen3-1.7B

NaNK
20,435
5

Qwen3-VL-2B-Instruct-GGUF

NaNK
license:apache-2.0
20,339
8

Meta-Llama-3.1-8B

NaNK
llama
20,329
42

Meta-Llama-3.1-8B-bnb-4bit

NaNK
llama
20,248
104

Llama-3.2-3B-Instruct-bnb-4bit

NaNK
llama
20,206
29

gpt-oss-20b-bnb-4bit

NaNK
license:apache-2.0
20,095
12

Qwen3-VL-32B-Instruct-GGUF

NaNK
license:apache-2.0
19,978
13

gemma-3-12b-it

NaNK
19,968
12

Qwen3-VL-235B-A22B-Instruct-GGUF

NaNK
license:apache-2.0
19,961
10

gemma-3-270m-it-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each, for the 4B, 12B, and 27B sizes. - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | n-shot | Gemma 3 PT 270M | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 10-shot | 40.9 | | [BoolQ][boolq] | 0-shot | 61.4 | | [PIQA][piqa] | 0-shot | 67.7 | | [TriviaQA][triviaqa] | 5-shot | 15.4 | | [ARC-c][arc] | 25-shot | 29.0 | | [ARC-e][arc] | 0-shot | 57.7 | | [WinoGrande][winogrande] | 5-shot | 52.0 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [triviaqa]: https://arxiv.org/abs/1705.03551 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 | Benchmark | n-shot | Gemma 3 IT 270m | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 0-shot | 37.7 | | [PIQA][piqa] | 0-shot | 66.2 | | [ARC-c][arc] | 0-shot | 28.2 | | [WinoGrande][winogrande] | 0-shot | 52.3 | | [BIG-Bench Hard][bbh] | few-shot | 26.7 | | [IF Eval][ifeval] | 0-shot | 51.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [piqa]: https://arxiv.org/abs/1911.11641 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [bbh]: https://paperswithcode.com/dataset/bbh [ifeval]: https://arxiv.org/abs/2311.07911 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [GPQA][gpqa] Diamond | 0-shot | 19.2 | 30.8 | 40.9 | 42.4 | | [SimpleQA][simpleqa] | 0-shot | 2.2 | 4.0 | 6.3 | 10.0 | | [FACTS Grounding][facts-grdg] | - | 36.4 | 70.1 | 75.8 | 74.9 | | [BIG-Bench Hard][bbh] | 0-shot | 39.1 | 72.2 | 85.7 | 87.6 | | [BIG-Bench Extra Hard][bbeh] | 0-shot | 7.2 | 11.0 | 16.3 | 19.3 | | [IFEval][ifeval] | 0-shot | 80.2 | 90.2 | 88.9 | 90.4 | | Benchmark | n-shot | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [gpqa]: https://arxiv.org/abs/2311.12022 [simpleqa]: https://arxiv.org/abs/2411.04368 [facts-grdg]: https://goo.gle/FACTSpaper [bbeh]: https://github.com/google-deepmind/bbeh [ifeval]: https://arxiv.org/abs/2311.07911 [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] (Pro) | 0-shot | 14.7 | 43.6 | 60.6 | 67.5 | | [LiveCodeBench][lcb] | 0-shot | 1.9 | 12.6 | 24.6 | 29.7 | | [Bird-SQL][bird-sql] (dev) | - | 6.4 | 36.3 | 47.9 | 54.4 | | [Math][math] | 0-shot | 48.0 | 75.6 | 83.8 | 89.0 | | HiddenMath | 0-shot | 15.8 | 43.0 | 54.5 | 60.3 | | [MBPP][mbpp] | 3-shot | 35.2 | 63.2 | 73.0 | 74.4 | | [HumanEval][humaneval] | 0-shot | 41.5 | 71.3 | 85.4 | 87.8 | | [Natural2Code][nat2code] | 0-shot | 56.0 | 70.3 | 80.7 | 84.5 | | [GSM8K][gsm8k] | 0-shot | 62.8 | 89.2 | 94.4 | 95.9 | | Benchmark | n-shot | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [bird-sql]: https://arxiv.org/abs/2305.03111 [nat2code]: https://arxiv.org/abs/2405.04520 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [Global-MMLU-Lite][global-mmlu-lite] | 0-shot | 34.2 | 54.5 | 69.5 | 75.1 | | [ECLeKTic][eclektic] | 0-shot | 1.4 | 4.6 | 10.3 | 16.7 | | [WMT24++][wmt24pp] | 0-shot | 35.9 | 46.8 | 51.6 | 53.4 | | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |-----------------------------------|:-------------:|:--------------:|:--------------:| | [MMMU][mmmu] (val) | 48.8 | 59.6 | 64.9 | | [DocVQA][docvqa] | 75.8 | 87.1 | 86.6 | | [InfoVQA][info-vqa] | 50.0 | 64.9 | 70.6 | | [TextVQA][textvqa] | 57.8 | 67.7 | 65.1 | | [AI2D][ai2d] | 74.8 | 84.2 | 84.5 | | [ChartQA][chartqa] | 68.8 | 75.7 | 78.0 | | [VQAv2][vqav2] (val) | 62.4 | 71.6 | 71.0 | | [MathVista][mathvista] (testmini) | 50.0 | 62.9 | 67.6 | | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ [mathvista]: https://arxiv.org/abs/2310.02255 Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://arxiv.org/abs/2503.19786 [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
19,693
4

ERNIE-4.5-21B-A3B-Thinking-GGUF

Over the past three months, we have continued to scale the thinking capability of ERNIE-4.5-21B-A3B, improving both the quality and depth of reasoning, thereby advancing the competitiveness of ERNIE lightweight models in complex reasoning tasks. We are pleased to introduce ERNIE-4.5-21B-A3B-Thinking, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, text generation, and academic benchmarks that typically require human expertise. Efficient tool usage capabilities. Enhanced 128K long-context understanding capabilities. > [!NOTE] > Note: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token. The following are the model configuration details: |Key|Value| |-|-| |Modality|Text| |Training Stage|Posttraining| |Params(Total / Activated)|21B / 3B| |Layers|28| |Heads(Q/KV)|20 / 4| |Text Experts(Total / Activated)|64 / 6| |Vision Experts(Total / Activated)|64 / 6| |Shared Experts|2| |Context Length|131072| > [!NOTE] > To align with the wider community, this model releases Transformer-style weights. Both PyTorch and PaddlePaddle ecosystem tools, such as vLLM, transformers, and FastDeploy, are expected to be able to load and run this model. Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository. Note: 80GB x 1 GPU resources are required. Deploying this model requires FastDeploy version 2.2. The ERNIE-4.5-21B-A3B-Thinking model supports function call. The `reasoning-parser` and `tool-call-parser` for vLLM Ernie are currently under development. Note: You'll need the`transformers`library (version 4.54.0 or newer) installed to use this model. The following contains a code snippet illustrating how to use the model generate content based on given inputs. The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved. If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:

NaNK
license:apache-2.0
19,379
37

DeepSeek-R1-Distill-Qwen-1.5B-GGUF

NaNK
license:apache-2.0
19,283
123

Phi-4-mini-instruct-unsloth-bnb-4bit

NaNK
license:mit
18,823
18

Qwen2.5-0.5B-Instruct

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 0.5B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
18,567
11

Qwen2.5-Coder-7B-Instruct-bnb-4bit

NaNK
license:apache-2.0
18,091
7

Llama-3.3-70B-Instruct-bnb-4bit

NaNK
llama
17,878
51

Qwen3-VL-8B-Instruct-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
17,254
2

LFM2-8B-A1B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of our first MoE based on LFM2, with 8.3B total parameters and 1.5B active parameters. - LFM2-8B-A1B is the best on-device MoE in terms of both quality (comparable to 3-4B dense models) and speed (faster than Qwen3-1.7B). - Code and knowledge capabilities are significantly improved compared to LFM2-2.6B. - Quantized variants fit comfortably on high-end phones, tablets, and laptops. Find more information about LFM2-8B-A1B in our blog post. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | LFM2-8B-A1B | | --------------------- | ----------------------------- | | Total parameters | 8.3B | | Active parameters | 1.5B | | Layers | 24 (18 conv + 6 attn) | | Context length | 32,768 tokens | | Vocabulary size | 65,536 | | Training precision| Mixed BF16/FP8 | | Training budget | 12 trillion tokens | | License | LFM Open License v1.0 | Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 18 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` from source as follows: Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. You can run the model in `vLLM` by building from source: You can run LFM2 with llama.cpp using its GGUF checkpoint. Find more information in the model card. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | | | DPO (TRL) | Preference alignment with Direct Preference Optimization (DPO) using TRL. | | Compared to similar-sized models, LFM2-8B-A1B displays strong performance in instruction following and math while also running significantly faster. | Model | MMLU | MMLU-Pro | GPQA | IFEval | IFBench | Multi-IF | |---|---|---|---|---|---|---| | LFM2-8B-A1B | 64.84 | 37.42 | 29.29 | 77.58 | 25.85 | 58.19 | | LFM2-2.6B | 64.42 | 25.96 | 26.57 | 79.56 | 22.19 | 60.26 | | Llama-3.2-3B-Instruct | 60.35 | 22.25 | 30.6 | 71.43 | 20.78 | 50.91 | | SmolLM3-3B | 59.84 | 23.90 | 26.31 | 72.44 | 17.93 | 58.86 | | gemma-3-4b-it | 58.35 | 34.76 | 29.51 | 76.85 | 23.53 | 66.61 | | Qwen3-4B-Instruct-2507 | 72.25 | 52.31 | 34.85 | 85.62 | 30.28 | 75.54 | | granite-4.0-h-tiny | 66.79 | 32.03 | 26.46 | 81.06 | 18.37 | 52.99 | | Model | GSM8K | GSMPlus | MATH 500 | MATH Lvl 5 | MGSM | MMMLU | |---|---|---|---|---|---|---| | LFM2-8B-A1B | 84.38 | 64.76 | 74.2 | 62.38 | 72.4 | 55.26 | | LFM2-2.6B | 82.41 | 60.75 | 63.6 | 54.38 | 74.32 | 55.39 | | Llama-3.2-3B-Instruct | 75.21 | 38.68 | 41.2 | 24.06 | 61.68 | 47.92 | | SmolLM3-3B | 81.12 | 58.91 | 73.6 | 51.93 | 68.72 | 50.02 | | gemma-3-4b-it | 89.92 | 68.38 | 73.2 | 52.18 | 87.28 | 50.14 | | Qwen3-4B-Instruct-2507 | 68.46 | 56.16 | 85.6 | 73.62 | 81.76 | 60.67 | | granite-4.0-h-tiny | 82.64 | 59.14 | 58.2 | 36.11 | 73.68 | 56.13 | | Model | Active params | LCB v6 | LCB v5 | HumanEval+ | Creative Writing v3 | |----------------------------|---------------|---------------|---------------|--------------------|-----------------------------| | LFM2-8B-A1B | 1.5B | 21.04% | 21.36% | 69.51% | 44.22% | | Gemma-3-1b-it | 1B | 4.27% | 4.43% | 37.20% | 41.67% | | Granite-4.0-h-tiny | 1B | 26.73% | 27.27% | 73.78% | 32.60% | | Llama-3.2-1B-Instruct | 1.2B | 4.08% | 3.64% | 23.17% | 31.43% | | Qwen2.5-1.5B-Instruct | 1.5B | 11.18% | 10.57% | 48.78% | 22.18% | | Qwen3-1.7B (/nothink) | 1.7B | 24.07% | 26.48% | 60.98% | 31.56% | | LFM2-2.6B | 2.6B | 14.41% | 14.43% | 57.93% | 38.79% | | SmolLM3-3B | 3.1B | 19.05% | 19.20% | 60.37% | 36.44% | | Llama-3.2-3B-Instruct | 3.2B | 11.47% | 11.48% | 24.06% | 38.84% | | Qwen3-4B (/nothink) | 4B | 36.11% | 38.64% | 71.95% | 37.49% | | Qwen3-4B-Instruct-2507 | 4B | 48.72% | 50.80% | 82.32% | 51.71% | | Gemma-3-4b-it | 4.3B | 18.86% | 19.09% | 62.8% | 68.56% | LFM2-8B-A1B is significantly faster than models with a similar number of active parameters, like Qwen3-1.7B. The following plots showcase the performance of different models under int4 quantization with int8 dynamic activations on the AMD Ryzen AI 9 HX 370 CPU, using 16 threads. The results are obtained using our internal XNNPACK-based inference stack, and a custom CPU MoE kernel. If you are interested in custom solutions with edge deployment, please contact our sales team.

NaNK
16,889
37

Qwen3-VL-8B-Instruct-1M-GGUF

NaNK
license:apache-2.0
16,833
1

Qwen3-VL-4B-Thinking-GGUF

NaNK
license:apache-2.0
16,299
10

Qwen3-VL-4B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
16,128
1

Llama-3.2-1B

See our collection for all versions of Llama 3.2 including GGUF, 4-bit and original 16-bit formats. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (1B) here: https://colab.research.google.com/drive/1T5-zKWM5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing Llama-3.2-1B For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK
llama
15,896
59

Qwen3-VL-235B-A22B-Thinking-GGUF

NaNK
license:apache-2.0
15,710
28

Qwen3-4B-Thinking-2507-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-2507 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Over the past three months, we have continued to scale the thinking capability of Qwen3-4B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-4B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-4B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Qwen3-30B-A3B Thinking | Qwen3-4B Thinking | Qwen3-4B-Thinking-2507 | |--- | --- | --- | --- | | Knowledge | | | | MMLU-Pro | 78.5 | 70.4 | 74.0 | | MMLU-Redux | 89.5 | 83.7 | 86.1 | | GPQA | 65.8 | 55.9 | 65.8 | | SuperGPQA | 51.8 | 42.7 | 47.8 | | Reasoning | | | | AIME25 | 70.9 | 65.6 | 81.3 | | HMMT25 | 49.8 | 42.1 | 55.5 | | LiveBench 20241125 | 74.3 | 63.6 | 71.8 | | Coding | | | | LiveCodeBench v6 (25.02-25.05) | 57.4 | 48.4 | 55.2 | | CFEval | 1940 | 1671 | 1852 | | OJBench | 20.7 | 16.1 | 17.9 | | Alignment | | | | IFEval | 86.5 | 81.9 | 87.4 | | Arena-Hard v2$ | 36.3 | 13.7 | 34.9 | | Creative Writing v3 | 79.1 | 61.1 | 75.6 | | WritingBench | 77.0 | 73.5 | 83.3 | | Agent | | | | BFCL-v3 | 69.1 | 65.9 | 71.2 | | TAU1-Retail | 61.7 | 33.9 | 66.1 | | TAU1-Airline | 32.0 | 32.0 | 48.0 | | TAU2-Retail | 34.2 | 38.6 | 53.5 | | TAU2-Airline | 36.0 | 28.0 | 58.0 | | TAU2-Telecom | 22.8 | 17.5 | 27.2 | | Multilingualism | | | | MultiIF | 72.2 | 66.3 | 77.3 | | MMLU-ProX | 73.1 | 61.0 | 64.2 | | INCLUDE | 71.9 | 61.8 | 64.4 | | PolyMATH | 46.1 | 40.0 | 46.2 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-4B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-4B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-4B-Thinking-2507 --served-model-name Qwen3-4B-Thinking-2507 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-4B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
15,691
66

Qwen3-VL-8B-Instruct

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
15,530
4

Qwen2.5-VL-7B-Instruct

NaNK
license:apache-2.0
15,392
14

Llama-3.2-1B-Instruct-bnb-4bit

NaNK
llama
15,008
19

Qwen3-4B-Base-unsloth-bnb-4bit

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5: - Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data. - Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and overall performance. - Three-stage Pre-training: Stage 1 focuses on broad language modeling and general knowledge acquisition, Stage 2 improves reasoning skills like STEM, coding, and logical reasoning, and Stage 3 enhances long-context comprehension by extending training sequence lengths up to 32k tokens. - Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three-stage pre-training pipeline, Qwen3 systematically tunes critical hyperparameters — such as learning rate scheduler and batch size — separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales. Qwen3-4B-Base has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.51.0`, you will encounter the following error: Detailed evaluation results are reported in this 📑 blog. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
14,941
1

Qwen3-VL-4B-Instruct-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
14,916
3

gpt-oss-120b-unsloth-bnb-4bit

See our collection for all versions of gpt-oss including GGUF, 4-bit & 16-bit formats. See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Read our Blog about gpt-oss support: unsloth.ai/blog/gpt-oss - View the rest of our notebooks in our docs here. - Thank you to the llama.cpp team for their work on supporting this model. We wouldn't be able to release quants without them! Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of the open models: - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fits into a single H100 GPU (117B parameters with 5.1B active parameters) - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. > [!NOTE] > This model card is dedicated to the larger `gpt-oss-120b` model. Check out `gpt-oss-20b` for the smaller model. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning. Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package. To get started, install the necessary dependencies to setup your environment: Once, setup you can proceed to run the model by running the snippet below: Alternatively, you can run the model via `Transformers Serve` to spin up a OpenAI-compatible webserver: Learn more about how to use gpt-oss with Transformers. vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository. If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama. If you are using LM Studio you can use the following commands to download. Check out our awesome list for a broader collection of gpt-oss resources and inference partners. You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI: You can adjust the reasoning level that suits your task across three levels: Low: Fast responses for general dialogue. Medium: Balanced speed and detail. High: Deep and detailed analysis. The reasoning level can be set in the system prompts, e.g., "Reasoning: high". The gpt-oss models are excellent for: Web browsing (using built-in browsing tools) Function calling with defined schemas Agentic operations like browser tasks Both gpt-oss models can be fine-tuned for a variety of specialized use cases. This larger model `gpt-oss-120b` can be fine-tuned on a single H100 node, whereas the smaller `gpt-oss-20b` can even be fine-tuned on consumer hardware.

NaNK
license:apache-2.0
14,555
9

Meta-Llama-3.1-70B-Instruct

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

NaNK
llama
14,180
4

Qwen3-VL-4B-Instruct

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
14,125
5

gpt-oss-120b-BF16

See our collection for all versions of gpt-oss including GGUF, 4-bit & 16-bit formats. See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Read our Blog about gpt-oss support: unsloth.ai/blog/gpt-oss - View the rest of our notebooks in our docs here. - Thank you to the llama.cpp team for their work on supporting this model. We wouldn't be able to release quants without them! Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of the open models: - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fits into a single H100 GPU (117B parameters with 5.1B active parameters) - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. > [!NOTE] > This model card is dedicated to the larger `gpt-oss-120b` model. Check out `gpt-oss-20b` for the smaller model. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Full chain-of-thought: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. Fine-tunable: Fully customize models to your specific use case through parameter fine-tuning. Agentic capabilities: Use the models’ native capabilities for function calling, web browsing, Python code execution, and Structured Outputs. Native MXFP4 quantization: The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the harmony response format. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our openai-harmony package. To get started, install the necessary dependencies to setup your environment: Once, setup you can proceed to run the model by running the snippet below: Alternatively, you can run the model via `Transformers Serve` to spin up a OpenAI-compatible webserver: Learn more about how to use gpt-oss with Transformers. vLLM recommends using uv for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. To learn about how to use this model with PyTorch and Triton, check out our reference implementations in the gpt-oss repository. If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after installing Ollama. If you are using LM Studio you can use the following commands to download. Check out our awesome list for a broader collection of gpt-oss resources and inference partners. You can download the model weights from the Hugging Face Hub directly from Hugging Face CLI: You can adjust the reasoning level that suits your task across three levels: Low: Fast responses for general dialogue. Medium: Balanced speed and detail. High: Deep and detailed analysis. The reasoning level can be set in the system prompts, e.g., "Reasoning: high". The gpt-oss models are excellent for: Web browsing (using built-in browsing tools) Function calling with defined schemas Agentic operations like browser tasks Both gpt-oss models can be fine-tuned for a variety of specialized use cases. This larger model `gpt-oss-120b` can be fine-tuned on a single H100 node, whereas the smaller `gpt-oss-20b` can even be fine-tuned on consumer hardware.

NaNK
license:apache-2.0
14,044
4

Qwen3-32B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-32B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 32.8B - Number of Paramaters (Non-Embedding): 31.2B - Number of Layers: 64 - Number of Attention Heads (GQA): 64 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-32B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-32B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "type": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
13,918
11

Apriel-1.5-15b-Thinker-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Apriel-1.5-15b-Thinker - Mid training is all you need! 1. Summary 2. Evaluation 3. Training Details 4. How to Use 5. Intended Use 6. Limitations 7. Security and Responsible Use 8. Software 9. License 10. Acknowledgements 11. Citation Click here to skip to the technical report -> https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf Apriel-1.5-15b-Thinker is a multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves competitive performance against models 10 times it's size. Apriel-1.5 is the second model in the reasoning series. It introduces enhanced textual reasoning capabilities and adds image reasoning support to the previous text model. It has undergone extensive continual pretraining across both text and image domains. In terms of post-training this model has undergone text-SFT only. Our research demonstrates that with a strong mid-training regimen, we are able to achive SOTA performance on text and image reasoning tasks without having any image SFT training or RL. Highlights - Achieves a score of 52 on the Artificial Analysis index and is competitive with Deepseek R1 0528, Gemini-Flash etc. - It is AT LEAST 1 / 10 the size of any other model that scores > 50 on the Artificial Analysis index. - Scores 68 on Tau2 Bench Telecom and 62 on IFBench, which are key benchmarks for the enterprise domain. - At 15B parameters, the model fits on a single GPU, making it highly memory-efficient. - For text benchmarks, we report evaluations perforomed by a third party - Artificial Analysis. - For image benchmarks, we report evaluations obtained by https://github.com/open-compass/VLMEvalKit We are a small lab with big goals. While we are not GPU poor, our lab, in comparison has a tiny fraction of the compute available to other Frontier labs. Our goal with this work is to show that a SOTA model can be built with limited resources if you have the right data, design and solid methodology. We set out to build a small but powerful model, aiming for capabilities on par with frontier models. Developing a 15B model with this level of performance requires tradeoffs, so we prioritized getting SOTA-level performance first. Mid-training consists only of CPT and SFT; no RL has been applied. This model performs extensive reasoning by default, allocating extra internal effort to improve robustness and accuracy even on simpler queries. You may notice slightly higher token usage and longer response times, but we are actively working to make it more efficient and concise in future releases. Mid training / Continual Pre‑training In this stage, the model is trained on billions of tokens of carefully curated textual samples drawn from mathematical reasoning, coding challenges, scientific discourse, logical puzzles, and diverse knowledge-rich texts along with multimodal samples covering image understanding and reasoning, captioning, and interleaved image-text data. The objective is to strengthen foundational reasoning capabilities of the model. This stage is critical for the model to function as a reasoner and provides significant lifts in reasoning benchmarks. Supervised Fine‑Tuning (SFT) The model is fine-tuned on over 2M high-quality text samples spanning mathematical and scientific problem-solving, coding tasks, instruction-following, API/function invocation, and conversational use cases. This results in superior text performance comparable to models such as Deepseek R1 0528 and Gemini-Flash. Although no image-specific fine-tuning is performed, the model’s inherent multimodal capabilities and cross-modal transfer of reasoning behavior from the text SFT yield competitive image performance relative to other leading open-source VL models. As the upstream PR is not yet merged, you can use this custom image as an alternate way to run the model with tool and reasoning parsers enabled. This will start the vLLM OpenAI-compatible API server serving the Apriel-1.5-15B-Thinker model with Apriel’s custom tool parser and reasoning parser. Here is a code snippet demonstrating the model's usage with the transformers library's generate function: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. 4. For multi-turn conversations, intermediate turns (historical model outputs) are expected to contain only the final response, without reasoning steps. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.

NaNK
license:mit
13,830
40

Qwen3-0.6B-bnb-4bit

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-0.6B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-0.6B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
13,628
3

Qwen3-32B-bnb-4bit

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-32B has the following features: - Type: Causal Language Models - Training Stage: Post-training - Number of Parameters: 32.8B - Number of Paramaters (Non-Embedding): 31.2B - Number of Layers: 64 - Number of Attention Heads (GQA): 64 for Q and 8 for KV - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseekr1 shell python -m sglang.launchserver --model-path Qwen/Qwen3-32B --reasoning-parser deepseek-r1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-32B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "type": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"type":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
13,550
7

medgemma-27b-text-it-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model on Google Cloud Model Garden: MedGemma Model on Hugging Face: MedGemma GitHub repository (supporting code, Colab notebooks, discussions, and issues): MedGemma Quick start notebook: GitHub Fine-tuning notebook: GitHub Patient Education Demo built using MedGemma Support: See Contact License: The use of MedGemma is governed by the Health AI Developer Foundations terms of use. This section describes the MedGemma model and how to use it. MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications. MedGemma currently comes in two variants: a 4B multimodal version and a 27B text-only version. MedGemma 27B has been trained exclusively on medical text and optimized for inference-time computation. MedGemma 27B is only available as an instruction-tuned model. MedGemma variants have been evaluated on a range of clinically relevant benchmarks to illustrate their baseline performance. These include both open benchmark datasets and curated datasets. Developers can fine-tune MedGemma variants for improved performance. Consult the Intended Use section below for more details. Below are some example code snippets to help you quickly get started running the model locally on GPU. If you want to use the model at scale, we recommend that you create a production version using Model Garden. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0. See the following Colab notebooks for examples of how to use MedGemma: To give the model a quick try, running it locally with weights from Hugging Face, see Quick start notebook in Colab. Note that you will need to use Colab Enterprise to run the 27B model without quantization. For an example of fine-tuning the model, see the Fine-tuning notebook in Colab. The MedGemma model is built based on Gemma 3 and uses the same decoder-only transformer architecture as Gemma 3. To read more about the architecture, consult the Gemma 3 model card. Model type: Decoder-only Transformer architecture, see the Gemma 3 technical report Modalities: 4B: Text, vision; 27B: Text only Attention mechanism: Utilizes grouped-query attention (GQA) Context length: Supports long context, at least 128K tokens Key publication: Coming soon Model created: May 20, 2025 Model version: 1.0.0 A technical report is coming soon. In the meantime, if you publish using this model, please cite the Hugging Face model page: Text string, such as a question or prompt Total input length of 128K tokens Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document Total output length of 8192 tokens MedGemma was evaluated across a range of different multimodal classification, report generation, visual question answering, and text-based tasks. MedGemma 4B and text-only MedGemma 27B were evaluated across a range of text-only benchmarks for medical knowledge and reasoning. The MedGemma models outperform their respective base Gemma models across all tested text-only health benchmarks. | Metric | MedGemma 27B | Gemma 3 27B | MedGemma 4B | Gemma 3 4B | | :---- | :---- | :---- | :---- | :---- | | MedQA (4-op) | 89.8 (best-of-5) 87.7 (0-shot) | 74.9 | 64.4 | 50.7 | | MedMCQA | 74.2 | 62.6 | 55.7 | 45.4 | | PubMedQA | 76.8 | 73.4 | 73.4 | 68.4 | | MMLU Med (text only) | 87.0 | 83.3 | 70.0 | 67.2 | | MedXpertQA (text only) | 26.7 | 15.7 | 14.2 | 11.6 | | AfriMed-QA | 84.0 | 72.0 | 52.0 | 48.0 | For all MedGemma 27B results, test-time scaling is used to improve performance. Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma 4B utilizes a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Its LLM component is trained on a diverse set of medical data, including medical text relevant to radiology images, chest-x rays, histopathology patches, ophthalmology images and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 5 different tasks and 6 medical image modalities. These include both open benchmark datasets and curated datasets, with a focus on expert human evaluations for tasks like CXR report generation and radiology VQA. MedGemma utilizes a combination of public and private datasets. This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), Slake-VQA (multimodal medical images and questions), PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays). Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next). Mimic-CXR: MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC). Slake-VQA: The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital. PAD-UFES-20: Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD). SCIN: A collaboration between Google Health and Stanford Medicine. TCGA (The Cancer Genome Atlas): A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC) CAMELYON: The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands. PMC-OA (PubMed Central Open Access Subset): Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH. MedQA: This dataset was created by a team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits Mendeley Digital Knee X-Ray: This dataset is from Rani Channamma University, and is hosted on Mendeley Data. AfriMed-QA: This data was developed and led by multiple collaborating organizations and researchers include key contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of Technology, and MasakhaneNLP. VQA-RAD: This dataset was created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman and their affiliated institutions (the US National Library of Medicine and National Institutes of Health) MedExpQA: This dataset was created by researchers at the HiTZ Center (Basque Center for Language Technology and Artificial Intelligence). MedXpertQA: This dataset was developed by researchers at Tsinghua University (Beijing, China) and Shanghai Artificial Intelligence Laboratory (Shanghai, China). In addition to the public datasets listed above, MedGemma was also trained on de-identified datasets licensed for research or collected internally at Google from consented participants. Radiology dataset 1: De-identified dataset of different CT studies across body parts from a US-based radiology outpatient diagnostic center network. Ophthalmology dataset 1: De-identified dataset of fundus images from diabetic retinopathy screening. Dermatology dataset 1: De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia. Dermatology dataset 2: De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia. Dermatology dataset 3: De-identified dataset of non-diseased skin images from an internal data collection effort. Pathology dataset 1: De-identified dataset of histopathology H&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes. Pathology dataset 2: De-identified dataset of lung histopathology H&E and IHC whole slide images created by a commercial biobank in the United States. Pathology dataset 3: De-identified dataset of prostate and lymph node H&E and IHC histopathology whole slide images created by a contract research organization in the United States. Pathology dataset 4: De-identified dataset of histopathology, predominantly H\&E whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H&E. MIMIC-CXR Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019). Available on Physionet Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). [PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation \[Online\]. 101 (23), pp. E215–e220.](https://pubmed.ncbi.nlm.nih.gov/10851218/) Bo Liu, Li-Ming Zhan, etc. SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones The Cancer Genome Atlas Program (TCGA) Babak Ehteshami Bejnordi, etc.: Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer MedQA: https://arxiv.org/abs/2009.13081 Mendeley Digital Knee X-Ray: Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1 AfriMed-QA: https://arxiv.org/abs/2411.15640 VQA-RAD: Lau, J., Gayen, S., Ben Abacha, A. et al. A dataset of clinically generated visual questions and answers about radiology images. Sci Data 5, 180251 (2018). https://doi.org/10.1038/sdata.2018.251 MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering MedXpertQA: arXiv:2501.18362v2 Google and partnerships utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. MedGemma is an open multimodal generative AI model intended to be used as a starting point that enables more efficient development of downstream healthcare applications involving medical text and images. MedGemma is intended for developers in the life sciences and healthcare space. Developers are responsible for training, adapting and making meaningful changes to MedGemma to accomplish their specific intended use. MedGemma models can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions. MedGemma is based on Gemma 3 and has been further trained on medical images and text. MedGemma enables further development in any medical context (image and textual), however the model was pre-trained using chest X-ray, pathology, dermatology, and fundus images. Examples of tasks within MedGemma's training include visual question answering pertaining to medical images, such as radiographs, or providing answers to textual medical questions. Full details of all the tasks MedGemma has been evaluated can be found in an upcoming technical report. Provides strong baseline medical image and text comprehension for models of its size. This strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training. This adaptation may involve prompt engineering, grounding, agentic orchestration or fine-tuning depending on the use case, baseline validation requirements, and desired performance characteristics. MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case. The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Performance benchmarks highlight baseline capabilities on relevant benchmarks, but even for image and text domains that constitute a substantial portion of training data, inaccurate model output is possible. All outputs from MedGemma should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies. MedGemma's multimodal capabilities have been primarily evaluated on single-image tasks. MedGemma has not been evaluated in use cases that involve comprehension of multiple images. MedGemma has not been evaluated or optimized for multi-turn applications. MedGemma's training may make it more sensitive to the specific prompt used than Gemma 3. When adapting MedGemma developer should consider the following: Bias in validation data: As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc). Data contamination concerns: When evaluating the generalization capabilities of a large model like MedGemma in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedGemma on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk.

NaNK
13,544
57

llama-3-8b-Instruct

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory via Unsloth! We have a Google Colab Tesla T4 notebook for Llama-3 8b here: https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Gemma 7b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | Llama-2 7b | ▶️ Start on Colab | 2.2x faster | 43% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | CodeLlama 34b A100 | ▶️ Start on Colab | 1.9x faster | 27% less | | Mistral 7b 1xT4 | ▶️ Start on Kaggle | 5x faster\ | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

NaNK
llama
13,474
65

Phi-4-mini-instruct-GGUF

This is Phi-4-mini-instruct with our BUG FIXES. See our collection for versions of Phi-4 with our bug fixes including GGUF & 4-bit formats. Unsloth's Phi-4 Dynamic Quants is selectively quantized, greatly improving accuracy over standard 4-bit. Finetune your own Reasoning model like R1 with Unsloth! We have a free Google Colab notebook for turning Phi-4 into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi4(14B)-GRPO.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Phi-4 | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Unsloth bug fixes: 1. Padding and EOS tokens are the same - fixed this. 2. Chat template had extra EOS token - removed this. Otherwise you will be during inference. 3. EOS token should be not . Otherwise it'll terminate at 4. Changed unktoken to � from EOS. Phi-4-mini-instruct is a lightweight open model built upon synthetic data and filtered publicly available websites - with a focus on high-quality, reasoning dense data. The model belongs to the Phi-4 model family and supports 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning and direct preference optimization to support precise instruction adherence and robust safety measures. 📰 Phi-4-mini Microsoft Blog 📖 Phi-4-mini Technical Report 👩‍🍳 Phi Cookbook 🏡 Phi Portal 🖥️ Try It Azure, Huggingface Phi-4: [mini-instruct | onnx]; multimodal-instruct; The model is intended for broad multilingual commercial and research use. The model provides uses for general purpose AI systems and applications which require: 1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially math and logic). The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features. The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. This release of Phi-4-mini-instruct is based on valuable user feedback from the Phi-3 series. The Phi-4-mini model employed new architecture for efficiency, larger vocabulary for multilingual support, and better post-training techniques were used for instruction following, function calling, as well as additional data leading to substantial gains on key capabilities. It is anticipated that most use cases will benefit from this release, but users are encouraged to test in their particular AI applications. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4-mini-instruct is welcomed and crucial to the model’s evolution and improvement. To understand the capabilities, the 3.8B parameters Phi-4-mini-instruct model was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). A high-level overview of the model quality is as follows: | Benchmark | Similar size | | | | |2x size | | | | | | |----------------------------------|-------------|-------------------|-------------------|-------------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------| | | Phi-4 mini-Ins | Phi-3.5-mini-Ins | Llama-3.2-3B-Ins | Mistral-3B | Qwen2.5-3B-Ins | Qwen2.5-7B-Ins | Mistral-8B-2410 | Llama-3.1-8B-Ins | Llama-3.1-Tulu-3-8B | Gemma2-9B-Ins | GPT-4o-mini-2024-07-18 | | Popular aggregated benchmark | | | | | | | | | | | | | Arena Hard | 32.8 | 34.4 | 17.0 | 26.9 | 32.0 | 55.5 | 37.3 | 25.7 | 42.7 | 43.7 | 53.7 | | BigBench Hard (0-shot, CoT) | 70.4 | 63.1 | 55.4 | 51.2 | 56.2 | 72.4 | 53.3 | 63.4 | 55.5 | 65.7 | 80.4 | | MMLU (5-shot) | 67.3 | 65.5 | 61.8 | 60.8 | 65.0 | 72.6 | 63.0 | 68.1 | 65.0 | 71.3 | 77.2 | | MMLU-Pro (0-shot, CoT) | 52.8 | 47.4 | 39.2 | 35.3 | 44.7 | 56.2 | 36.6 | 44.0 | 40.9 | 50.1 | 62.8 | | Reasoning | | | | | | | | | | | | | ARC Challenge (10-shot) | 83.7 | 84.6 | 76.1 | 80.3 | 82.6 | 90.1 | 82.7 | 83.1 | 79.4 | 89.8 | 93.5 | | BoolQ (2-shot) | 81.2 | 77.7 | 71.4 | 79.4 | 65.4 | 80.0 | 80.5 | 82.8 | 79.3 | 85.7 | 88.7 | | GPQA (0-shot, CoT) | 25.2 | 26.6 | 24.3 | 24.4 | 23.4 | 30.6 | 26.3 | 26.3 | 29.9 | 39.1 | 41.1 | | HellaSwag (5-shot) | 69.1 | 72.2 | 77.2 | 74.6 | 74.6 | 80.0 | 73.5 | 72.8 | 80.9 | 87.1 | 88.7 | | OpenBookQA (10-shot) | 79.2 | 81.2 | 72.6 | 79.8 | 79.3 | 82.6 | 80.2 | 84.8 | 79.8 | 90.0 | 90.0 | | PIQA (5-shot) | 77.6 | 78.2 | 68.2 | 73.2 | 72.6 | 76.2 | 81.2 | 83.2 | 78.3 | 83.7 | 88.7 | | Social IQA (5-shot) | 72.5 | 75.1 | 68.3 | 73.9 | 75.3 | 75.3 | 77.6 | 71.8 | 73.4 | 74.7 | 82.9 | | TruthfulQA (MC2) (10-shot) | 66.4 | 65.2 | 59.2 | 62.9 | 64.3 | 69.4 | 63.0 | 69.2 | 64.1 | 76.6 | 78.2 | | Winogrande (5-shot) | 67.0 | 72.2 | 53.2 | 59.8 | 63.3 | 71.1 | 63.1 | 64.7 | 65.4 | 74.0 | 76.9 | | Multilingual | | | | | | | | | | | | | Multilingual MMLU (5-shot) | 49.3 | 51.8 | 48.1 | 46.4 | 55.9 | 64.4 | 53.7 | 56.2 | 54.5 | 63.8 | 72.9 | | MGSM (0-shot, CoT) | 63.9 | 49.6 | 44.6 | 44.6 | 53.5 | 64.5 | 56.7 | 56.7 | 58.6 | 75.1 | 81.7 | | Math | | | | | | | | | | | | | GSM8K (8-shot, CoT) | 88.6 | 76.9 | 75.6 | 80.1 | 80.6 | 88.7 | 81.9 | 82.4 | 84.3 | 84.9 | 91.3 | | MATH (0-shot, CoT) | 64.0 | 49.8 | 46.7 | 41.8 | 61.7 | 60.4 | 41.6 | 47.6 | 46.1 | 51.3 | 70.2 | | Overall | 63.5 | 60.5 | 56.2 | 56.9 | 60.1 | 67.9 | 60.2 | 62.3 | 60.9 | 65.0 | 75.5 | Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings. Phi-4-mini-instruct supports a vocabulary size of up to `200064` tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size. Given the nature of the training data, the Phi-4-mini-instruct model is best suited for prompts using specific formats. Below are the two primary formats: This format is used for general conversation and instructions: This format is used when the user wants the model to provide function calls based on the given tools. The user should provide the available tools in the system prompt, wrapped by and tokens. The tools should be specified in JSON format, using a JSON dump structure. Example: ` You are a helpful assistant with some tools. [{"name": "getweatherupdates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}] What is the weather like in Paris today? ` To perform inference using vLLM, you can use the following code snippet: Phi-4 family has been integrated in the `4.49.0` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. Phi-4-mini-instruct is also available in [Azure AI Studio]() After obtaining the Phi-4-mini-instruct model checkpoints, users can use this sample code for inference. Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses. + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift. Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. + Architecture: Phi-4-mini-instruct has 3.8B parameters and is a dense decoder-only Transformer model. When compared with Phi-3.5-mini, the major changes with Phi-4-mini-instruct are 200K vocabulary, grouped-query attention, and shared input and output embedding. + Inputs: Text. It is best suited for prompts using the chat format. + Context length: 128K tokens + GPUs: 512 A100-80G + Training time: 21 days + Training data: 5T tokens + Outputs: Generated text in response to the input + Dates: Trained between November and December 2024 + Status: This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data. + Supported languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian + Release date: February 2025 Phi-4-mini’s training data includes a wide variety of sources, totaling 5 trillion tokens, and is a combination of 1) publicly available documents filtered for quality, selected high-quality educational data, and code 2) newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (e.g., science, daily activities, theory of mind, etc.) 3) high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for frontier models, but such information was removed to leave more model capacity for reasoning for the model’s small size. More details about data can be found in the Phi-4-mini-instruct technical report. The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis. A basic example of multi-GPUs supervised fine-tuning (SFT) with TRL and Accelerate modules is provided here. Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red team tested the model in English, Chinese, Japanese, Spanish, Portuguese, Arabic, Thai, and Russian for the following potential harms: Hate Speech and Bias, Violent Crimes, Specialized Advice, and Election Information. Their findings indicate that the model is resistant to jailbreak techniques across languages, but that language-specific attack prompts leveraging cultural context can cause the model to output harmful content. Another insight was that with function calling scenarios, the model could sometimes hallucinate function names or URL’s. The model may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken. Hardware Note that by default, the Phi-4-mini-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA A6000 NVIDIA H100 If you want to run the model on: NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" License The model is licensed under the MIT license. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. There are, however, some exceptions to this. In some cases, we see a model that performs worse than expected on a given eval due to a failure to respect the output format. For example: + A model may refuse to answer questions (for no apparent reason), or in coding tasks models may prefix their response with “Sure, I can help with that. …” which may break the parser. In such cases, we have opted to try different system messages (e.g. “You must always respond to a question” or “Get to the point!”). + With some models, we observed that few shots actually hurt model performance. In this case we did allow running the benchmarks with 0-shots for all cases. + We have tools to convert between chat and completions APIs. When converting a chat prompt to a completion prompt, some models have different keywords e.g. Human vs User. In these cases, we do allow for model-specific mappings for chat to completion prompts. + Pick different few-shot examples. Few shots will always be the same when comparing different models. + Change prompt format: e.g. if it is an A/B/C/D multiple choice, we do not tweak this to 1/2/3/4 multiple choice. The model was evaluated across a breadth of public and internal benchmarks to understand the model’s capabilities under multiple tasks and conditions. While most evaluations use English, the leading multilingual benchmark was incorporated that covers performance in select languages. More specifically, + Reasoning: + Winogrande: commonsense reasoning around pronoun resolution + PIQA: physical commonsense reasoning around everyday situations + ARC-challenge: grade-school multiple choice science questions + GPQA: very hard questions written and validated by experts in biology, physics, and chemistry + MedQA: medical questions answering + Social IQA: social commonsense intelligence + BoolQ: natural questions from context + TruthfulQA: grounded reasoning + Language understanding: + HellaSwag: commonsense natural language inference around everyday events + ANLI: adversarial natural language inference + Function calling: + Berkeley function calling function and tool call + Internal function calling benchmarks + World knowledge: + TriviaQA: trivia question on general topics + Math: + GSM8K: grade-school math word problems + GSM8K Hard: grade-school math word problems with large values and some absurdity. + MATH: challenging competition math problems + Code: + HumanEval HumanEval+, MBPP, MBPP+: python coding tasks + LiveCodeBenh, LiveBench: contamination-free code tasks + BigCode Bench: challenging programming tasks + Spider: SQL query tasks + Internal coding benchmarks + Instructions following: + IFEval: verifiable instructions + Internal instructions following benchmarks + Multilingual: + MGSM: multilingual grade-school math + Multilingual MMLU and MMLU-pro + MEGA: multilingual NLP tasks + Popular aggregated datasets: MMLU, MMLU-pro, BigBench-Hard, AGI Eval + Multi-turn conversations: + Data generated by in-house adversarial conversation simulation tool + Single-turn trustworthiness evaluation: + DecodingTrust: a collection of trustworthiness benchmarks in eight different perspectives + XSTest: exaggerated safety evaluation + Toxigen: adversarial and hate speech detection + Red Team: + Responses to prompts provided by AI Red Team at Microsoft

license:mit
13,420
56

Llama-3.2-3B-bnb-4bit

See our collection for all versions of Llama 3.2 including GGUF, 4-bit and original 16-bit formats. Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1T5-zKWM5OD21QHwXHiV9ixTRR7k3iB9?usp=sharing unsloth/Llama-3.2-3B-bnb-4bit For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK
llama
13,198
20

csm-1b

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune TTS models for free using our Google Colab notebooks here! - Read our Blog about TTS support: unsloth.ai/blog/tts | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Sesame-CSM-1B | ▶️ Start on Colab-TTS.ipynb) | 1.5x faster | 58% less | | Whisper Large V3 | ▶️ Start on Colab | 1.5x faster | 50% less | | Qwen3 (14B) | ▶️ Start on Colab | 2x faster | 70% less | | Llama 3.2 Vision (11B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 50% less | 2025/03/13 - We are releasing the 1B CSM variant. Code is available on GitHub: SesameAILabs/csm. CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes. A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post. A hosted HuggingFace space is also available for testing audio generation. CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance. The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice. CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation. The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well. This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following: - Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent. - Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls. - Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes. By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology. Authors Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

NaNK
license:apache-2.0
12,793
17

Nanonets-OCR-s-GGUF

12,750
58

Mistral-Small-24B-Instruct-2501-unsloth-bnb-4bit

NaNK
license:apache-2.0
12,328
17

DeepSeek-R1-Distill-Qwen-32B-GGUF

See our collection for versions of Deepseek-R1 including GGUF and original formats. Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseek-r1 1. Do not forget about ` ` and ` ` tokens! - Or use a chat template formatter 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp 3. Example with Q80 K quantized cache Notice -no-cnv disables auto conversation mode 4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers. Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the DeepSeek team for creating and releasing these models. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section. Post-Training: Large-Scale Reinforcement Learning on the Base Model - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-R1-Zero | 671B | 37B | 128K | 🤗 HuggingFace | | DeepSeek-R1 | 671B | 37B | 128K | 🤗 HuggingFace | DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository. | Model | Base Model | Download | | :------------: | :------------: | :------------: | | DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 🤗 HuggingFace | |DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 🤗 HuggingFace | DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers. Please use our setting to run these models. DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | 92.9 | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | 84.0 | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | 92.2 | | | IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | 82.5 | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | 87.6 | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | 92.3 | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | 65.9 | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 | 2029 | | | SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | 79.8 | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | 97.3 | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | 78.8 | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | 92.8 | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | 91.8 | | | C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - | 63.7 | | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | 86.7 | 94.5 | 65.2 | 57.5 | 1633 | 5. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally. DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using vLLM: We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. Avoid adding a system prompt; all instructions should be contained within the user prompt. 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. 7. License This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that: - DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1. - DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license. - DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license. 9. Contact If you have any questions, please raise an issue or contact us at [email protected].

NaNK
license:apache-2.0
12,187
142

Qwen3-VL-32B-Thinking-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! > See our Qwen3-VL collection for all versions including GGUF, 4-bit & 16-bit formats. See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3-VL-8B for free using our Google Colab notebook - Or train Qwen3-VL with reinforcement learning (GSPO) with our free notebook. - View the rest of our notebooks in our docs here. --- Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
12,068
10

Qwen2.5-32B-Instruct-bnb-4bit

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 32B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 32.5B - Number of Paramaters (Non-Embedding): 31.0B - Number of Layers: 64 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
11,936
13

Qwen2.5-1.5B-Instruct-unsloth-bnb-4bit

See our collection for versions of Qwen2.5 including 4-bit formats. Unsloth's Dynamic 4-bit Quants is selectively quantized, greatly improving accuracy over standard 4-bit. Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Qwen2.5 (7B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5(7B)-Alpaca.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the base 0.5B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
11,924
5

gemma-3-27b-it-unsloth-bnb-4bit

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Unsloth's Dynamic Quants is selectively quantized, greatly improving accuracy over standard 4-bit. - Fine-tune Gemma 3 (12B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Gemma 3 (12B) | ▶️ Start on Colab | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation: | Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://goo.gle/Gemma3Report [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
11,884
19

granite-4.0-h-micro-GGUF

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
11,797
13

tinyllama-chat

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory via Unsloth! All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma 7b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | Llama-2 7b | ▶️ Start on Colab | 2.2x faster | 43% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | CodeLlama 34b A100 | ▶️ Start on Colab | 1.9x faster | 27% less | | Mistral 7b 1xT4 | ▶️ Start on Kaggle | 5x faster\ | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

llama
11,612
6

Llama-3.2-3B

Finetune Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing Llama-3.2-3B For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models. The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 family of models Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here.

NaNK
llama
11,598
17

Qwen3-0.6B-Base-unsloth-bnb-4bit

NaNK
license:apache-2.0
11,576
4

Qwen3-VL-2B-Thinking-1M-GGUF

NaNK
license:apache-2.0
11,485
1

granite-3.3-2b-instruct-GGUF

NaNK
11,314
5

Apertus-70B-Instruct-2509-GGUF

NaNK
license:apache-2.0
11,061
8

FLUX.1-Kontext-dev-GGUF

GGUF files for black-forest-labs/FLUX.1-Kontext-dev. Original license still applies. A huge thanks to @doublemathew and @hrsvrn for help with conversion. GGUFs were created with city96/ComfyUI-GGUF and llama.cpp. To run, use ComfyUI-GGUF and place the model files in the `ComfyUI/models/unet` directory. For more instructions, go to the README. `FLUX.1 Kontext [dev]` is a 12 billion parameter rectified flow transformer capable of editing images based on text instructions. For more information, please read our blog post and our technical report. You can find information about the `[pro]` version in here. Key Features 1. Change existing images based on an edit instruction. 2. Have character, style and object reference without any finetuning. 3. Robust consistency allows users to refine an image through multiple successive edits with minimal visual drift. 4. Trained using guidance distillation, making `FLUX.1 Kontext [dev]` more efficient. 5. Open weights to drive new scientific research, and empower artists to develop innovative workflows. 6. Generated outputs can be used for personal, scientific, and commercial purposes, as described in the [FLUX.1 \[dev\] Non-Commercial License](https://github.com/black-forest-labs/flux/blob/main/modellicenses/LICENSE-FLUX1-dev). Usage We provide a reference implementation of `FLUX.1 Kontext [dev]`, as well as sampling code, in a dedicated github repository. Developers and creatives looking to build on top of `FLUX.1 Kontext [dev]` are encouraged to use this as a starting point. `FLUX.1 Kontext [dev]` is also available in both ComfyUI and Diffusers. Flux Kontext comes with an integrity checker, which should be run after the image generation step. To run the safety checker, install the official repository from black-forest-labs/flux and add the following code: For VRAM saving measures and speed ups check out the diffusers docs Black Forest Labs is committed to the responsible development of generative AI technology. Prior to releasing FLUX.1 Kontext, we evaluated and mitigated a number of risks in our models and services, including the generation of unlawful content. We implemented a series of pre-release mitigations to help prevent misuse by third parties, with additional post-release mitigations to help address residual risks: 1. Pre-training mitigation. We filtered pre-training data for multiple categories of “not safe for work” (NSFW) content to help prevent a user generating unlawful content in response to text prompts or uploaded images. 2. Post-training mitigation. We have partnered with the Internet Watch Foundation, an independent nonprofit organization dedicated to preventing online abuse, to filter known child sexual abuse material (CSAM) from post-training data. Subsequently, we undertook multiple rounds of targeted fine-tuning to provide additional mitigation against potential abuse. By inhibiting certain behaviors and concepts in the trained model, these techniques can help to prevent a user generating synthetic CSAM or nonconsensual intimate imagery (NCII) from a text prompt, or transforming an uploaded image into synthetic CSAM or NCII. 3. Pre-release evaluation. Throughout this process, we conducted multiple internal and external third-party evaluations of model checkpoints to identify further opportunities for improvement. The third-party evaluations—which included 21 checkpoints of FLUX.1 Kontext [pro] and [dev]—focused on eliciting CSAM and NCII through adversarial testing with text-only prompts, as well as uploaded images with text prompts. Next, we conducted a final third-party evaluation of the proposed release checkpoints, focused on text-to-image and image-to-image CSAM and NCII generation. The final FLUX.1 Kontext [pro] (as offered through the FLUX API only) and FLUX.1 Kontext [dev] (released as an open-weight model) checkpoints demonstrated very high resilience against violative inputs, and FLUX.1 Kontext [dev] demonstrated higher resilience than other similar open-weight models across these risk categories. Based on these findings, we approved the release of the FLUX.1 Kontext [pro] model via API, and the release of the FLUX.1 Kontext [dev] model as openly-available weights under a non-commercial license to support third-party research and development. 4. Inference filters. We are applying multiple filters to intercept text prompts, uploaded images, and output images on the FLUX API for FLUX.1 Kontext [pro]. Filters for CSAM and NCII are provided by Hive, a third-party provider, and cannot be adjusted or removed by developers. We provide filters for other categories of potentially harmful content, including gore, which can be adjusted by developers based on their specific risk profile. Additionally, the repository for the open FLUX.1 Kontext [dev] model includes filters for illegal or infringing content. Filters or manual review must be used with the model under the terms of the FLUX.1 [dev] Non-Commercial License. We may approach known deployers of the FLUX.1 Kontext [dev] model at random to verify that filters or manual review processes are in place. 5. Content provenance. The FLUX API applies cryptographically-signed metadata to output content to indicate that images were produced with our model. Our API implements the Coalition for Content Provenance and Authenticity (C2PA) standard for metadata. 6. Policies. Access to our API and use of our models are governed by our Developer Terms of Service, Usage Policy, and FLUX.1 [dev] Non-Commercial License, which prohibit the generation of unlawful content or the use of generated content for unlawful, defamatory, or abusive purposes. Developers and users must consent to these conditions to access the FLUX Kontext models. 7. Monitoring. We are monitoring for patterns of violative use after release, and may ban developers who we detect intentionally and repeatedly violate our policies via the FLUX API. Additionally, we provide a dedicated email address ([email protected]) to solicit feedback from the community. We maintain a reporting relationship with organizations such as the Internet Watch Foundation and the National Center for Missing and Exploited Children, and we welcome ongoing engagement with authorities, developers, and researchers to share intelligence about emerging risks and develop effective mitigations. License This model falls under the [FLUX.1 \[dev\] Non-Commercial License](https://github.com/black-forest-labs/flux/blob/main/modellicenses/LICENSE-FLUX1-dev). Citation

10,962
53

Qwen3-235B-A22B-Instruct-2507-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | We introduce the updated version of the Qwen3-235B-A22B non-thinking mode, named Qwen3-235B-A22B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-235B-A22B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Claude Opus 4 Non-thinking | Kimi K2 | Qwen3-235B-A22B Non-thinking | Qwen3-235B-A22B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | ---| | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 86.6 | 81.1 | 75.2 | 83.0 | | MMLU-Redux | 90.4 | 91.3 | 94.2 | 92.7 | 89.2 | 93.1 | | GPQA | 68.4 | 66.9 | 74.9 | 75.1 | 62.9 | 77.5 | | SuperGPQA | 57.3 | 51.0 | 56.5 | 57.2 | 48.2 | 62.6 | | SimpleQA | 27.2 | 40.3 | 22.8 | 31.0 | 12.2 | 54.3 | | CSimpleQA | 71.1 | 60.2 | 68.0 | 74.5 | 60.8 | 84.3 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 33.9 | 49.5 | 24.7 | 70.3 | | HMMT25 | 27.5 | 7.9 | 15.9 | 38.8 | 10.0 | 55.4 | | ARC-AGI | 9.0 | 8.8 | 30.3 | 13.3 | 4.3 | 41.8 | | ZebraLogic | 83.4 | 52.6 | - | 89.0 | 37.7 | 95.0 | | LiveBench 20241125 | 66.9 | 63.7 | 74.6 | 76.4 | 62.5 | 75.4 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 44.6 | 48.9 | 32.9 | 51.8 | | MultiPL-E | 82.2 | 82.7 | 88.5 | 85.7 | 79.3 | 87.9 | | Aider-Polyglot | 55.1 | 45.3 | 70.7 | 59.0 | 59.6 | 57.3 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 87.4 | 89.8 | 83.2 | 88.7 | | Arena-Hard v2 | 45.6 | 61.9 | 51.5 | 66.1 | 52.0 | 79.2 | | Creative Writing v3 | 81.6 | 84.9 | 83.8 | 88.1 | 80.4 | 87.5 | | WritingBench | 74.5 | 75.5 | 79.2 | 86.2 | 77.0 | 85.2 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 60.1 | 65.2 | 68.0 | 70.9 | | TAU-Retail | 49.6 | 60.3# | 81.4 | 70.7 | 65.2 | 71.3 | | TAU-Airline | 32.0 | 42.8# | 59.6 | 53.5 | 32.0 | 44.0 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | - | 76.2 | 70.2 | 77.5 | | MMLU-ProX | 75.8 | 76.2 | - | 74.5 | 73.2 | 79.4 | | INCLUDE | 80.1 | 82.1 | - | 76.9 | 75.6 | 79.5 | | PolyMATH | 32.2 | 25.5 | 30.0 | 44.8 | 27.0 | 50.2 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
10,862
102

Qwen2.5-14B-Instruct-bnb-4bit

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 14B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 14.7B - Number of Paramaters (Non-Embedding): 13.1B - Number of Layers: 48 - Number of Attention Heads (GQA): 40 for Q and 8 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
10,783
7

Qwen2.5-Coder-7B-bnb-4bit

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). For Qwen2.5-Coder, we release three base language models and instruction-tuned language models, 1.5, 7 and 32 (coming soon) billion parameters. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5: - Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. - A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies. - Long-context Support up to 128K tokens. This repo contains the 7B Qwen2.5-Coder model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., or fill in the middle tasks on this model. For more details, please refer to our blog, GitHub, Documentation, Arxiv. The code of Qwen2.5-Coder has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
10,730
10

DeepSeek-R1-Distill-Llama-70B-GGUF

See our collection for versions of Deepseek-R1 including GGUF and original formats. Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseek-r1 1. Do not forget about ` ` and ` ` tokens! - Or use a chat template formatter 2. Obtain the latest `llama.cpp` at https://github.com/ggerganov/llama.cpp 3. Example with Q80 K quantized cache Notice -no-cnv disables auto conversation mode 4. If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers. Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the DeepSeek team for creating and releasing these models. We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the Usage Recommendation section. Post-Training: Large-Scale Reinforcement Learning on the Base Model - We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. - We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. - Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-R1-Zero | 671B | 37B | 128K | 🤗 HuggingFace | | DeepSeek-R1 | 671B | 37B | 128K | 🤗 HuggingFace | DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository. | Model | Base Model | Download | | :------------: | :------------: | :------------: | | DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 🤗 HuggingFace | |DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 🤗 HuggingFace | | DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 🤗 HuggingFace | DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. We slightly change their configs and tokenizers. Please use our setting to run these models. DeepSeek-R1-Evaluation For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1. | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 | |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------| | | Architecture | - | - | MoE | - | - | MoE | | | # Activated Params | - | - | 37B | - | - | 37B | | | # Total Params | - | - | 671B | - | - | 671B | | English | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | 91.8 | 90.8 | | | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | 92.9 | | | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | 84.0 | | | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | 92.2 | | | IF-Eval (Prompt Strict) | 86.5 | 84.3 | 86.1 | 84.8 | - | 83.3 | | | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | 75.7 | 71.5 | | | SimpleQA (Correct) | 28.4 | 38.2 | 24.9 | 7.0 | 47.0 | 30.1 | | | FRAMES (Acc.) | 72.5 | 80.5 | 73.3 | 76.9 | - | 82.5 | | | AlpacaEval2.0 (LC-winrate) | 52.0 | 51.1 | 70.0 | 57.8 | - | 87.6 | | | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | 92.3 | | Code | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | 65.9 | | | Codeforces (Percentile) | 20.3 | 23.6 | 58.7 | 93.4 | 96.6 | 96.3 | | | Codeforces (Rating) | 717 | 759 | 1134 | 1820 | 2061 | 2029 | | | SWE Verified (Resolved) | 50.8 | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (Acc.) | 45.3 | 16.0 | 49.6 | 32.9 | 61.7 | 53.3 | | Math | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | 79.8 | | | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | 97.3 | | | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | 78.8 | | Chinese | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | 92.8 | | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | 91.8 | | | C-SimpleQA (Correct) | 55.4 | 58.7 | 68.0 | 40.3 | - | 63.7 | | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | |------------------------------------------|------------------|-------------------|-----------------|----------------------|----------------------|-------------------| | GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | | Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | | o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | | QwQ-32B-Preview | 44.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 | | DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 | | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 | | DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.9 | 59.1 | 53.1 | 1481 | | DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 | 94.3 | 62.1 | 57.2 | 1691 | | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 | | DeepSeek-R1-Distill-Llama-70B | 70.0 | 86.7 | 94.5 | 65.2 | 57.5 | 1633 | 5. Chat Website & API Platform You can chat with DeepSeek-R1 on DeepSeek's official website: chat.deepseek.com, and switch on the button "DeepThink" We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com Please visit DeepSeek-V3 repo for more information about running DeepSeek-R1 locally. DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models. For instance, you can easily start a service using vLLM: We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs. 2. Avoid adding a system prompt; all instructions should be contained within the user prompt. 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}." 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results. 7. License This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that: - DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-32B are derived from Qwen-2.5 series, which are originally licensed under Apache 2.0 License, and now finetuned with 800k samples curated with DeepSeek-R1. - DeepSeek-R1-Distill-Llama-8B is derived from Llama3.1-8B-Base and is originally licensed under llama3.1 license. - DeepSeek-R1-Distill-Llama-70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license. 9. Contact If you have any questions, please raise an issue or contact us at [email protected].

NaNK
llama
10,678
93

Llama-3.3-70B-Instruct-GGUF

NaNK
llama
10,641
88

granite-4.0-h-1b-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model Summary: Granite-4.0-H-1B is a lightweight instruct model finetuned from Granite-4.0-H-1B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-1B model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-1B comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-1B model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-H-1B baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
10,641
10

Qwen2.5-VL-3B-Instruct

--- base_model: - Qwen/Qwen2.5-VL-3B-Instruct license_name: qwen-research license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE language: - en pipeline_tag: image-text-to-text tags: - multimodal - unsloth - unsloth library_name: transformers ---

NaNK
10,602
5

Qwen3-4B-Thinking-2507-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Over the past three months, we have continued to scale the thinking capability of Qwen3-4B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-4B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-4B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 4.0B - Number of Paramaters (Non-Embedding): 3.6B - Number of Layers: 36 - Number of Attention Heads (GQA): 32 for Q and 8 for KV - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Qwen3-30B-A3B Thinking | Qwen3-4B Thinking | Qwen3-4B-Thinking-2507 | |--- | --- | --- | --- | | Knowledge | | | | MMLU-Pro | 78.5 | 70.4 | 74.0 | | MMLU-Redux | 89.5 | 83.7 | 86.1 | | GPQA | 65.8 | 55.9 | 65.8 | | SuperGPQA | 51.8 | 42.7 | 47.8 | | Reasoning | | | | AIME25 | 70.9 | 65.6 | 81.3 | | HMMT25 | 49.8 | 42.1 | 55.5 | | LiveBench 20241125 | 74.3 | 63.6 | 71.8 | | Coding | | | | LiveCodeBench v6 (25.02-25.05) | 57.4 | 48.4 | 55.2 | | CFEval | 1940 | 1671 | 1852 | | OJBench | 20.7 | 16.1 | 17.9 | | Alignment | | | | IFEval | 86.5 | 81.9 | 87.4 | | Arena-Hard v2$ | 36.3 | 13.7 | 34.9 | | Creative Writing v3 | 79.1 | 61.1 | 75.6 | | WritingBench | 77.0 | 73.5 | 83.3 | | Agent | | | | BFCL-v3 | 69.1 | 65.9 | 71.2 | | TAU1-Retail | 61.7 | 33.9 | 66.1 | | TAU1-Airline | 32.0 | 32.0 | 48.0 | | TAU2-Retail | 34.2 | 38.6 | 53.5 | | TAU2-Airline | 36.0 | 28.0 | 58.0 | | TAU2-Telecom | 22.8 | 17.5 | 27.2 | | Multilingualism | | | | MultiIF | 72.2 | 66.3 | 77.3 | | MMLU-ProX | 73.1 | 61.0 | 64.2 | | INCLUDE | 71.9 | 61.8 | 64.4 | | PolyMATH | 46.1 | 40.0 | 46.2 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-4B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-4B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-4B-Thinking-2507 --served-model-name Qwen3-4B-Thinking-2507 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-4B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
10,473
2

Qwen3-235B-A22B-Thinking-2507-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-2507 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Over the past three months, we have continued to scale the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-235B-A22B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-source thinking models. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-235B-A22B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-R1-0528 | OpenAI O4-mini | OpenAI O3 | Gemini-2.5 Pro | Claude4 Opus Thinking | Qwen3-235B-A22B Thinking | Qwen3-235B-A22B-Thinking-2507 | |--- | --- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | | MMLU-Pro | 85.0 | 81.9 | 85.9 | 85.6 | - | 82.8 | 84.4 | | MMLU-Redux | 93.4 | 92.8 | 94.9 | 94.4 | 94.6 | 92.7 | 93.8 | | GPQA | 81.0 | 81.4 | 83.3 | 86.4 | 79.6 | 71.1 | 81.1 | | SuperGPQA | 61.7 | 56.4 | - | 62.3 | - | 60.7 | 64.9 | | Reasoning | | | | | | | | AIME25 | 87.5 | 92.7 | 88.9 | 88.0 | 75.5 | 81.5 | 92.3 | | HMMT25 | 79.4 | 66.7 | 77.5 | 82.5 | 58.3 | 62.5 | 83.9 | | LiveBench 20241125 | 74.7 | 75.8 | 78.3 | 82.4 | 78.2 | 77.1 | 78.4 | | HLE | 17.7# | 18.1 | 20.3 | 21.6 | 10.7 | 11.8# | 18.2# | | Coding | | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 68.7 | 71.8 | 58.6 | 72.5 | 48.9 | 55.7 | 74.1 | | CFEval | 2099 | 1929 | 2043 | 2001 | - | 2056 | 2134 | | OJBench | 33.6 | 33.3 | 25.4 | 38.9 | - | 25.6 | 32.5 | | Alignment | | | | | | | | | IFEval | 79.1 | 92.4 | 92.1 | 90.8 | 89.7 | 83.4 | 87.8 | | Arena-Hard v2$ | 72.2 | 59.3 | 80.8 | 72.5 | 59.1 | 61.5 | 79.7 | | Creative Writing v3 | 86.3 | 78.8 | 87.7 | 85.9 | 83.8 | 84.6 | 86.1 | | WritingBench | 83.2 | 78.4 | 85.3 | 83.1 | 79.1 | 80.3 | 88.3 | | Agent | | | | | | | | | BFCL-v3 | 63.8 | 67.2 | 72.4 | 67.2 | 61.8 | 70.8 | 71.9 | | TAU2-Retail | 64.9 | 71.0 | 76.3 | 71.3 | - | 40.4 | 71.9 | | TAU2-Airline | 60.0 | 59.0 | 70.0 | 60.0 | - | 30.0 | 58.0 | | TAU2-Telecom | 33.3 | 42.0 | 60.5 | 37.4 | - | 21.9 | 45.6 | | Multilingualism | | | | | | | | | MultiIF | 63.5 | 78.0 | 80.3 | 77.8 | - | 71.9 | 80.6 | | MMLU-ProX | 80.6 | 79.0 | 83.3 | 84.7 | - | 80.0 | 81.0 | | INCLUDE | 79.4 | 80.8 | 86.6 | 85.1 | - | 78.7 | 81.0 | | PolyMATH | 46.9 | 48.7 | 49.7 | 52.2 | - | 54.7 | 60.1 | \ For OpenAI O4-mini and O3, we use a medium reasoning effort, except for scores marked with , which are generated using high reasoning effort. \# According to the official evaluation criteria of HLE, scores marked with \# refer to models that are not multi-modal and were evaluated only on the text-only subset. $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8 --context-length 262144 --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-235b-a22b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-235B-A22B-Thinking-2507 --served-model-name Qwen3-235B-A22B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-235B-A22B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
10,319
76

Qwen2.5-3B-Instruct

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 3B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
9,967
15

Llama-3.1-8B-Instruct-GGUF

See our collection for versions of Llama 3.1 including 4-bit & 16-bit formats. Unsloth Dynamic v2.0 achieves superior accuracy & outperforms other leading quant methods. - Read our Blog about Llama 3.1 fine-tuning support: unsloth.ai/blog/llama4 - View the rest of our fine-tuning notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp, vLLM or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Llama 3.1 (8B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.1 family of models. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here. Intended Use Cases Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card. Note : Llama 3.1 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.1 models for languages beyond the 8 supported languages provided they comply with the Llama 3.1 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.1 in additional languages is done in a safe and responsible manner. This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at `huggingface-llama-recipes` LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting here. Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool: You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation. To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure. Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq. The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023. In this section, we report the results for Llama 3.1 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. Responsibility & Safety As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Responsible deployment Llama is a foundational technology designed to be used in a variety of use cases, examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology power, by aligning our model safety for the generic use cases addressing a standard set of harms. Developers are then in the driver seat to tailor safety for their use case, defining their own policy and deploying the models with the necessary safeguards in their Llama systems. Llama 3.1 was developed following the best practices outlined in our Responsible Use Guide, you can refer to the Responsible Use Guide to learn more. Llama 3.1 instruct Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. Fine-tuning data We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals and Tone Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Llama 3.1 systems Large language models, including Llama 3.1, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard 3, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. New capabilities Note that this release introduces new capabilities, including a longer context window, multilingual inputs and outputs and possible integrations by developers with third party tools. Building with these new capabilities requires specific considerations in addition to the best practices that generally apply across all Generative AI use cases. Tool-use: Just like in standard software development, developers are responsible for the integration of the LLM with the tools and services of their choice. They should define a clear policy for their use case and assess the integrity of the third party services they use to be aware of the safety and security limitations when using this capability. Refer to the Responsible Use Guide for best practices on the safe deployment of the third party safeguards. Multilinguality: Llama 3.1 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Llama may be able to output text in other languages than those that meet performance thresholds for safety and helpfulness. We strongly discourage developers from using this model to converse in non-supported languages without implementing finetuning and system controls in alignment with their policies and the best practices shared in the Responsible Use Guide. Evaluations We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, coding assistant, tool calls. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, tools calls, coding or memorization. Red teaming For both scenarios, we conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. Critical and other risks We specifically focused our efforts on mitigating the following critical risk areas: 1- CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. 2. Child Safety Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. 3. Cyber attack enablement Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Our study of Llama-3.1-405B’s social engineering uplift for cyber attackers was conducted to assess the effectiveness of AI models in aiding cyber threat actors in spear phishing campaigns. Please read our Llama 3.1 Cyber security whitepaper to learn more. Community Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Ethical Considerations and Limitations The core values of Llama 3.1 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.1 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3.1 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.1’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.1 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

NaNK
llama
9,922
20

Qwen2.5-Omni-7B-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness). OmniBench Speech | Sound Event | Music | Avg Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91% Librispeech dev-clean | dev other | test-clean | test-other SALMONN -|-|2.1|4.9 Common Voice 15 en | zh | yue | fr Whisper-large-v3 9.3|12.8|10.9|10.8 Wenetspeech test-net | test-meeting Seed-ASR-Chinese 4.7|5.7 CoVoST2 en-de | de-en | en-zh | zh-en SALMONN 18.6|-|33.1|- MusicCaps LP-MusicCaps 0.291|0.149|0.089| 0.061 |0.129|0.130 Qwen2.5-Omni-3B 0.325| 0.163 | 0.093 |0.057| 0.132 | 0.229 Qwen2.5-Omni-7B 0.328 |0.162|0.090|0.055|0.127|0.225 MMAU Sound | Music | Speech | Avg Gemini-Pro-V1.5 56.75|49.40|58.55|54.90 VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU Ultravox-v0.4.1-LLaMA-3.1-8B 4.55 |3.90|53.35|47.17 VoiceBench OpenBookQA | IFEval | AdvBench | Avg Ultravox-v0.4.1-LLaMA-3.1-8B 65.27| 66.88 |98.46|71.45 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMU val | 59.2 | 53.1 | 53.9 | 58.6 | 60.0 | | MMMU-Pro overall | 36.6 | 29.7 | - | 38.3 | 37.6 | | MathVista testmini | 67.9 | 59.4 | 71.9 | 68.2 | 52.5 | | MathVision full | 25.0 | 20.8 | 23.1 | 25.1 | - | | MMBench-V1.1-EN test | 81.8 | 77.8 | 80.5 | 82.6 | 76.0 | | MMVet turbo | 66.8 | 62.1 | 67.5 | 67.1 | 66.9 | | MMStar | 64.0 | 55.7 | 64.0 | 63.9 | 54.8 | | MME sum | 2340 | 2117 | 2372 | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | 59.2 | - | | CRPE relation | 76.5 | 73.7 | - | 76.4 | - | | RealWorldQA avg | 70.3 | 62.6 | 71.9 | 68.5 | - | | MME-RealWorld en | 61.6 | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | 6.3 | - | | AI2D | 83.2 | 79.5 | 85.8 | 83.9 | - | | TextVQA val | 84.4 | 79.8 | 83.2 | 84.9 | - | | DocVQA test | 95.2 | 93.3 | 93.5 | 95.7 | - | | ChartQA test Avg | 85.3 | 82.8 | 84.9 | 87.3 | - | | OCRBenchV2 en | 57.8 | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcoco val | 90.5 | 88.7 | 90.0 | 90.6 | 73.2 | | Refcoco textA | 93.5 | 91.8 | 92.5 | 93.2 | 72.9 | | Refcoco textB | 86.6 | 84.0 | 85.4 | 88.2 | 74.6 | | Refcoco+ val | 85.4 | 81.1 | 84.2 | 88.2 | 62.5 | | Refcoco+ textA | 91.0 | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+ textB | 79.3 | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+ val | 87.4 | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+ test | 87.9 | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | 55.0 | 36.7 | | PointGrounding | 66.5 | 46.2 | 67.3 | - | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MME w/o sub | 64.3 | 62.0 | 63.9 | 65.1 | 64.8 | | Video-MME w sub | 72.4 | 68.6 | 67.9 | 71.6 | - | | MVBench | 70.3 | 68.7 | 67.2 | 69.6 | - | | EgoSchema test | 68.6 | 61.4 | 63.2 | 65.0 | - | SEED test-zh | test-en | test-hard Seed-TTSICL 1.11 | 2.24 | 7.58 SEED test-zh | test-en | test-hard Seed-TTSICL 0.796 | 0.762 | 0.776 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | 56.3 | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | 75.4 | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench 0831 | 29.6 | 22.3 | 35.9 | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | 36.4 | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | 75.5 | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | 91.6 | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | 84.8 | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | 79.2 | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | 70.4 | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench 2305-2409 | 24.6 | 16.5 | 28.7 | 19.9 | 23.9 | 8.3 | 18.9 | Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenomniutils`: |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend | | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attnimplementation="flashattention2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here. Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `returnaudio=False` is set. Here is an example. Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. Use audio in video In the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter in these places must be set to the same, otherwise unexpected results will occur. The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disabletalker()` after init the model. This option will save about `~2GB` of GPU memory but the `returnaudio` option for `generate` function will only allow to be set at `False`. In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs to get text responses faster. Change voice type of output audio Qwen2.5-Omni supports the ability to change the voice of the output audio. The `"Qwen/Qwen2.5-Omni-7B"` checkpoint support two voice types as follow: | Voice Type | Gender | Description | |------------|--------|-------------| | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.| Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`. First, make sure to install the latest version of Flash Attention 2: Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. To load and run a model using FlashAttention-2, add `attnimplementation="flashattention2"` when loading the model: If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

NaNK
9,848
43

gemma-3-4b-pt-unsloth-bnb-4bit

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Unsloth's Dynamic Quants is selectively quantized, greatly improving accuracy over standard 4-bit. - Fine-tune Gemma 3 (12B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Gemma 3 (12B) | ▶️ Start on Colab | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation: | Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://goo.gle/Gemma3Report [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
9,828
3

Llama-3.1-8B-Instruct-bnb-4bit

See our collection for versions of Llama 3.1 including 4-bit + 16b-bit formats. Finetune your own Reasoning model like R1 with Unsloth! We have a free Google Colab notebook for turning Llama 3.1 (8B) into a reasoning model: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-GRPO.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Phi-4 (14B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.1 family of models. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go here. Intended Use Cases Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card. Note : Llama 3.1 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.1 models for languages beyond the 8 supported languages provided they comply with the Llama 3.1 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.1 in additional languages is done in a safe and responsible manner. This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at `huggingface-llama-recipes` LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting here. Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool: You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation. To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure. Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq. The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.1 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023. In this section, we report the results for Llama 3.1 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. Responsibility & Safety As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Responsible deployment Llama is a foundational technology designed to be used in a variety of use cases, examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology power, by aligning our model safety for the generic use cases addressing a standard set of harms. Developers are then in the driver seat to tailor safety for their use case, defining their own policy and deploying the models with the necessary safeguards in their Llama systems. Llama 3.1 was developed following the best practices outlined in our Responsible Use Guide, you can refer to the Responsible Use Guide to learn more. Llama 3.1 instruct Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. Fine-tuning data We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals and Tone Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Llama 3.1 systems Large language models, including Llama 3.1, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard 3, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. New capabilities Note that this release introduces new capabilities, including a longer context window, multilingual inputs and outputs and possible integrations by developers with third party tools. Building with these new capabilities requires specific considerations in addition to the best practices that generally apply across all Generative AI use cases. Tool-use: Just like in standard software development, developers are responsible for the integration of the LLM with the tools and services of their choice. They should define a clear policy for their use case and assess the integrity of the third party services they use to be aware of the safety and security limitations when using this capability. Refer to the Responsible Use Guide for best practices on the safe deployment of the third party safeguards. Multilinguality: Llama 3.1 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Llama may be able to output text in other languages than those that meet performance thresholds for safety and helpfulness. We strongly discourage developers from using this model to converse in non-supported languages without implementing finetuning and system controls in alignment with their policies and the best practices shared in the Responsible Use Guide. Evaluations We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, coding assistant, tool calls. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, tools calls, coding or memorization. Red teaming For both scenarios, we conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. Critical and other risks We specifically focused our efforts on mitigating the following critical risk areas: 1- CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. 2. Child Safety Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. 3. Cyber attack enablement Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Our study of Llama-3.1-405B’s social engineering uplift for cyber attackers was conducted to assess the effectiveness of AI models in aiding cyber threat actors in spear phishing campaigns. Please read our Llama 3.1 Cyber security whitepaper to learn more. Community Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Ethical Considerations and Limitations The core values of Llama 3.1 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.1 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3.1 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.1’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.1 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

NaNK
llama
9,826
0

gemma-3-12b-it-qat-GGUF

> [!Note] > This repository corresponds to the 12B instruction-tuned version of the Gemma 3 model using Quantization Aware Training (QAT). > > The checkpoint in this repository is unquantized, please make sure to quantize with Q40 with your favorite tool > > Thanks to QAT, the model is able to preserve similar quality as `bfloat16` while significantly reducing the memory requirements > to load the model. [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." > [!Note] > The evaluation in this section correspond to the original checkpoint, not the QAT checkpoint. > These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation: | Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://goo.gle/Gemma3Report [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
9,555
31

Qwen2.5-3B-Instruct-bnb-4bit

NaNK
license:apache-2.0
9,142
10

DeepSeek-V3.1-Terminus-GGUF

Learn how to run DeepSeek-V3.1 Terminus correctly - Read our Guide . See how DeepSeek-V3.1 Dynamic 3-bit GGUF scores 75.6% on Aider Polyglot here . These quants include our Unsloth chat template fixes, specifically for llama.cpp supported backends. - You must use --jinja for llama.cpp quants - Set the temperature ~0.6 (recommended) and TopP value of 0.95 (recommended) - UD-Q2KXL (247GB) is recommended - For complete detailed instructions, see our guide: unsloth.ai/blog/deepseek-v3.1 This update maintains the model's original capabilities while addressing issues reported by users, including: - Language consistency: Reducing instances of mixed Chinese-English text and occasional abnormal characters; - Agent capabilities: Further optimizing the performance of the Code Agent and Search Agent. | Benchmark | DeepSeek-V3.1 | DeepSeek-V3.1-Terminus | | :--- | :---: | :---: | | Reasoning Mode w/o Tool Use | | | | MMLU-Pro | 84.8 | 85.0 | | GPQA-Diamond | 80.1 | 80.7 | | Humanity's Last Exam | 15.9 | 21.7 | | LiveCodeBench | 74.8 | 74.9 | | Codeforces | 2091 | 2046 | | Aider-Polyglot | 76.3 | 76.1 | | Agentic Tool Use | | | | BrowseComp | 30.0 | 38.5 | | BrowseComp-zh | 49.2 | 45.0 | | SimpleQA | 93.4 | 96.8 | | SWE Verified | 66.0 | 68.4 | | SWE-bench Multilingual | 54.5 | 57.8 | | Terminal-bench | 31.3 | 36.7 | The template and tool-set of search agent have been updated, which is shown in `assets/searchtooltrajectory.html`. The model structure of DeepSeek-V3.1-Terminus is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. For the model's chat template other than search agent, please refer to the DeepSeek-V3.1 repo. Here we also provide an updated inference demo code in the `inference` folder to help the community get started with running our model and understand the details of model architecture. NOTE: In the current model checkpoint, the parameters of `selfattn.oproj` do not conform to the UE8M0 FP8 scale data format. This is a known issue and will be corrected in future model releases. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].

license:mit
8,985
60

Qwen2-VL-7B-Instruct-unsloth-bnb-4bit

Unsloth's Dynamic 4-bit Quants selectively avoids quantizing certain parameters, greatly improving accuracy while keeping VRAM usage similar to BnB 4-bit. See our full collection of Unsloth quants on Hugging Face here. Finetune Llama 3.2, Qwen 2.5, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Qwen2-VL (7B) here: https://colab.research.google.com/drive/1whHb54GNZMrNxIsi2wm2EY-Pvo2QyKh?usp=sharing And a free notebook for Llama 3.2 Vision (11B) here unsloth/Qwen2-VL-7B-Instruct-bnb-4bit For more details on the model, please go to Qwen's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 40% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 40% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Qwen team for creating and releasing these models. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience. Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities. We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2-8B | MiniCPM-V 2.6 | GPT-4o-mini | Qwen2-VL-7B | | :--- | :---: | :---: | :---: | :---: | | MMMU val | 51.8 | 49.8 | 60| 54.1 | | DocVQA test | 91.6 | 90.8 | - | 94.5 | | InfoVQA test | 74.8 | - | - |76.5 | | ChartQA test | 83.3 | - |- | 83.0 | | TextVQA val | 77.4 | 80.1 | -| 84.3 | | OCRBench | 794 | 852 | 785 | 845 | | MTVQA | - | - | -| 26.3 | | VCR en easy | - | 73.88 | 83.60 | 89.70 | | VCR zh easy | - | 10.18| 1.10 | 59.94 | | RealWorldQA | 64.4 | - | - | 70.1 | | MME sum | 2210.3 | 2348.4 | 2003.4| 2326.8 | | MMBench-EN test | 81.7 | - | - | 83.0 | | MMBench-CN test | 81.2 | - | - | 80.5 | | MMBench-V1.1 test | 79.4 | 78.0 | 76.0| 80.7 | | MMT-Bench test | - | - | - |63.7 | | MMStar | 61.5 | 57.5 | 54.8 | 60.7 | | MMVet GPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | | HallBench avg | 45.2 | 48.1 | 46.1| 50.6 | | MathVista testmini | 58.3 | 60.6 | 52.4 | 58.2 | | MathVision | - | - | - | 16.3 | | Benchmark | Internvl2-8B | LLaVA-OneVision-7B | MiniCPM-V 2.6 | Qwen2-VL-7B | | :--- | :---: | :---: | :---: | :---: | | MVBench | 66.4 | 56.7 | - | 67.0 | | PerceptionTest test | - | 57.1 | - | 62.3 | | EgoSchema test | - | 60.1 | - | 66.7 | | Video-MME wo/w subs | 54.0/56.9 | 58.2/- | 60.9/63.6 | 63.3/69.0 | Requirements The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command `pip install git+https://github.com/huggingface/transformers`, or you might encounter the following error: Quickstart We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions: 1. Lack of Audio Support: The current model does not comprehend audio information within videos. 2. Data timeliness: Our image dataset is updated until June 2023, and information subsequent to this date may not be covered. 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands. 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement. 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements. 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects. These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
8,959
12

Llama-3.2-1B-bnb-4bit

NaNK
llama
8,953
15

gemma-2-9b-bnb-4bit

Finetune Gemma, Llama 3, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Google Colab Tesla T4 notebook for Gemma 2 (9B) here: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

NaNK
8,945
31

Llama-3.2-1B-unsloth-bnb-4bit

NaNK
llama
8,934
2

Qwen3-Coder-480B-A35B-Instruct-GGUF

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-Coder correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Today, we're announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we're excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct. featuring the following key enhancements: - Significant Performance among open models on Agentic Coding, Agentic Browser-Use, and other foundational coding tasks, achieving results comparable to Claude Sonnet. - Long-context Capabilities with native support for 256K tokens, extendable up to 1M tokens using Yarn, optimized for repository-scale understanding. - Agentic Coding supporting for most platforms such as Qwen Code, CLINE, featuring a specially designed function call format. Qwen3-480B-A35B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 480B in total and 35B activated - Number of Layers: 62 - Number of Attention Heads (GQA): 96 for Q and 8 for KV - Number of Experts: 160 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. We advise you to use the latest version of `transformers`. Define Tools tools=[ { "type":"function", "function":{ "name": "squarethenumber", "description": "output the square of the number.", "parameters": { "type": "object", "required": ["inputnum"], "properties": { 'inputnum': { 'type': 'number', 'description': 'inputnum is a number that will be squared' } }, } } } ] import OpenAI Define LLM client = OpenAI( # Use a custom endpoint compatible with OpenAI API baseurl='http://localhost:8000/v1', # apibase apikey="EMPTY" ) messages = [{'role': 'user', 'content': 'square the number 1024'}] completion = client.chat.completions.create( messages=messages, model="Qwen3-480B-A35B-Instruct", maxtokens=65536, tools=tools, ) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
8,932
159

medgemma-4b-it-unsloth-bnb-4bit

NaNK
8,912
1

Qwen2.5-VL-3B-Instruct-GGUF

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MMMU val | 52.3 | 54.1 | 53.1| | MMMU-Pro val | 32.7 | 30.5 | 31.6| | AI2D test | 81.4 | 83.0 | 81.5 | | DocVQA test | 91.6 | 94.5 | 93.9 | | InfoVQA test | 72.1 | 76.5 | 77.1 | | TextVQA val | 76.8 | 84.3 | 79.3| | MMBench-V1.1 test | 79.3 | 80.7 | 77.6 | | MMStar | 58.3 | 60.7 | 55.9 | | MathVista testmini | 60.5 | 58.2 | 62.3 | | MathVision full | 20.9 | 16.3 | 21.2 | Video benchmark | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MVBench | 71.6 | 67.0 | 67.0 | | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 | | MLVU | 48.3 | - | 68.2 | | LVBench | - | - | 43.3 | | MMBench-Video | 1.73 | 1.44 | 1.63 | | EgoSchema | - | - | 64.8 | | PerceptionTest | - | - | 66.9 | | TempCompass | - | - | 64.4 | | LongVideoBench | 55.2 | 55.6 | 54.2 | | CharadesSTA/mIoU | - | - | 38.8 | Agent benchmark | Benchmarks | Qwen2.5-VL-3B | |-------------------------|---------------| | ScreenSpot | 55.5 | | ScreenSpot Pro | 23.9 | | AITZEM | 76.9 | | Android Control HighEM | 63.7 | | Android Control LowEM | 22.2 | | AndroidWorldSR | 90.8 | | MobileMiniWob++SR | 67.9 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.

NaNK
8,812
17

Qwen2.5-VL-3B-Instruct-unsloth-bnb-4bit

NaNK
8,790
14

Llama-4-Maverick-17B-128E-Instruct-GGUF

See our collection for versions of Llama 4 including 4-bit & 16-bit formats. Read our Guide to see how to Fine-tune & Run Llama 4 correctly. |MoE Bits|Type|Disk Size|HF Link|Accuracy| |:-|:-|:-|:-|:-| |1.78bit|IQ1\S|122GB|Link|Ok| |1.93bit|IQ1\M|128GB|Link|Fair| |2.42-bit|IQ2\XXS|140GB|Link|Better| |2.71-bit|Q2\K\XL|151B|Link|Suggested| |3.5-bit|Q3\K\XL|193GB|Link|Great| |4.5-bit|Q4\K\XL|243GB|Link|Best| 🦙 Fine-tune Meta's Llama 4 with Unsloth! - Fine-tune Llama-4-Scout on a single H100 80GB GPU using Unsloth! - Read our Blog about Llama 4 support: unsloth.ai/blog/llama4 - View the rest of our notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp, vLLM or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Llama 3.1 (8B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We are launching two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion parameter model with 16 experts, and Llama 4 Maverick, a 17 billion parameter model with 128 experts. Model Architecture: The Llama 4 models are auto-regressive language models that use a mixture-of-experts (MoE) architecture and incorporate early fusion for native multimodality. Model Name Training Data Params Input modalities Output modalities Context length Token count Knowledge cutoff Llama 4 Scout (17Bx16E) A mix of publicly available, licensed data and information from Meta's products and services. This includes publicly shared posts from Instagram and Facebook and people's interactions with Meta AI. Learn more in our Privacy Center . Multilingual text and image Multilingual text and code 10M ~40T August 2024 Llama 4 Maverick (17Bx128E) 17B (Activated) 400B (Total) Multilingual text and image Multilingual text and code 1M ~22T August 2024 Supported languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Status: This is a static model trained on an offline dataset. Future versions of the tuned models may be released as we improve model behavior with community feedback. Where to send questions or comments about the model: Instructions on how to provide feedback or comments on the model can be found in the Llama README. For more technical information about generation parameters and recipes for how to use Llama 4 in applications, please go here. Please, make sure you have transformers `v4.51.0` installed, or upgrade using `pip install -U transformers`. Intended Use Cases: Llama 4 is intended for commercial and research use in multiple languages. Instruction tuned models are intended for assistant-like chat and visual reasoning tasks, whereas pretrained models can be adapted for natural language generation. For vision, Llama 4 models are also optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The Llama 4 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 4 Community License allows for these use cases. Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 4 Community License. Use in languages or capabilities beyond those explicitly referenced as supported in this model card\\. 1\. Llama 4 has been trained on a broader collection of languages than the 12 supported languages (pre-training includes 200 total languages). Developers may fine-tune Llama 4 models for languages beyond the 12 supported languages provided they comply with the Llama 4 Community License and the Acceptable Use Policy. Developers are responsible for ensuring that their use of Llama 4 in additional languages is done in a safe and responsible manner. 2\. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications. Training Factors: We used custom training libraries, Meta's custom built GPU clusters, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure. Training Energy Use: Model pre-training utilized a cumulative of 7.38M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 1,999 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with clean and renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq. | Model Name | Training Time (GPU hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) | | :---- | :---: | :---: | :---: | :---: | | Llama 4 Scout | 5.0M | 700 | 1,354 | 0 | | Llama 4 Maverick | 2.38M | 700 | 645 | 0 | | Total | 7.38M | \- | 1,999 | 0 | The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 4 Scout was pretrained on \~40 trillion tokens and Llama 4 Maverick was pretrained on \~22 trillion tokens of multimodal data from a mix of publicly available, licensed data and information from Meta’s products and services. This includes publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI. Data Freshness: The pretraining data has a cutoff of August 2024\. In this section, we report the results for Llama 4 relative to our previous models. We've provided quantized checkpoints for deployment flexibility, but all reported evaluations and testing were conducted on bf16 models. | Pre-trained models | | | | | | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Category | Benchmark | \# Shots | Metric | Llama 3.1 70B | Llama 3.1 405B | Llama 4 Scout | Llama 4 Maverick | | Reasoning & Knowledge | MMLU | 5 | macro\avg/acc\char | 79.3 | 85.2 | 79.6 | 85.5 | | | MMLU-Pro | 5 | macro\avg/em | 53.8 | 61.6 | 58.2 | 62.9 | | | MATH | 4 | em\maj1@1 | 41.6 | 53.5 | 50.3 | 61.2 | | Code | MBPP | 3 | pass@1 | 66.4 | 74.4 | 67.8 | 77.6 | | Multilingual | TydiQA | 1 | average/f1 | 29.9 | 34.3 | 31.5 | 31.7 | | Image | ChartQA | 0 | relaxed\accuracy | No multimodal support | | 83.4 | 85.3 | | | DocVQA | 0 | anls | | | 89.4 | 91.6 | | Instruction tuned models | | | | | | | | | :---: | :---: | :---: | :---: | :---: | ----- | :---: | :---: | | Category | Benchmark | \# Shots | Metric | Llama 3.3 70B | Llama 3.1 405B | Llama 4 Scout | Llama 4 Maverick | | Image Reasoning | MMMU | 0 | accuracy | No multimodal support | | 69.4 | 73.4 | | | MMMU Pro^ | 0 | accuracy | | | 52.2 | 59.6 | | | MathVista | 0 | accuracy | | | 70.7 | 73.7 | | Image Understanding | ChartQA | 0 | relaxed\accuracy | | | 88.8 | 90.0 | | | DocVQA (test) | 0 | anls | | | 94.4 | 94.4 | | Coding | LiveCodeBench (10/01/2024-02/01/2025) | 0 | pass@1 | 33.3 | 27.7 | 32.8 | 43.4 | | Reasoning & Knowledge | MMLU Pro | 0 | macro\avg/acc | 68.9 | 73.4 | 74.3 | 80.5 | | | GPQA Diamond | 0 | accuracy | 50.5 | 49.0 | 57.2 | 69.8 | | Multilingual | MGSM | 0 | average/em | 91.1 | 91.6 | 90.6 | 92.3 | | Long context | MTOB (half book) eng-\>kgv/kgv-\>eng | \- | chrF | Context window is 128K | | 42.2/36.6 | 54.0/46.4 | | | MTOB (full book) eng-\>kgv/kgv-\>eng | \- | chrF | | | 39.7/36.3 | 50.8/46.7 | ^reported numbers for MMMU Pro is the average of Standard and Vision tasks The Llama 4 Scout model is released as BF16 weights, but can fit within a single H100 GPU with on-the-fly int4 quantization; the Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while still maintaining quality. We provide code for on-the-fly int4 quantization which minimizes performance degradation as well. As part of our release approach, we followed a three-pronged strategy to manage risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Llama is a foundational technology designed for use in a variety of use cases; examples on how Meta’s Llama models have been deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology, by aligning our model’s safety for a standard set of risks. Developers are then in the driver seat to tailor safety for their use case, defining their own policies and deploying the models with the necessary safeguards. Llama 4 was developed following the best practices outlined in our Developer Use Guide: AI Protections. The primary objective of conducting safety fine-tuning is to offer developers a readily available, safe, and powerful model for various applications, reducing the workload needed to deploy safe AI systems. Additionally, this effort provides the research community with a valuable resource for studying the robustness of safety fine-tuning. Fine-tuning data We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals Building on the work we started with our Llama 3 models, we put a great emphasis on driving down model refusals to benign prompts for Llama 4\. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Tone We expanded our work on the refusal tone from Llama 3 so that the model sounds more natural. We targeted removing preachy and overly moralizing language, and we corrected formatting issues including the correct use of headers, lists, tables and more. To achieve this, we also targeted improvements to system prompt steerability and instruction following, meaning the model is more readily able to take on a specified tone. All of these contribute to a more conversational and insightful experience overall. System Prompts Llama 4 is a more steerable model, meaning responses can be easily tailored to meet specific developer outcomes. Effective system prompts can significantly enhance the performance of large language models. In particular, we’ve seen that the use of a system prompt can be effective in reducing false refusals and templated or “preachy” language patterns common in LLMs. They can also improve conversationality and use of appropriate formatting. Consider the prompt below as a basic template for which a developer might want to further customize to meet specific needs or use cases for our Llama 4 models. | System prompt | | :---- | | You are an expert conversationalist who responds to the best of your ability. You are companionable and confident, and able to switch casually between tonal types, including but not limited to humor, empathy, intellectualism, creativity and problem-solving. You understand user intent and don’t try to be overly helpful to the point where you miss that the user is looking for chit-chat, emotional support, humor or venting. Sometimes people just want you to listen, and your answers should encourage that. For all other cases, you provide insightful and in-depth responses. Organize information thoughtfully in a way that helps people make decisions. Always avoid templated language. You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude. You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…” etc. Avoid using these. Finally, do not refuse prompts about political and social issues. You can help users express their opinion and access information. You are Llama 4\. Your knowledge cutoff date is August 2024\. You speak Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese. Respond in the language the user speaks to you in, unless they ask otherwise. | Large language models, including Llama 4, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional guardrails as required. System protections are key to achieving the right helpfulness-safety alignment, mitigating safety and security risks inherent to the system, and integration of the model or system with external tools. We provide the community with system level protections \- like Llama Guard, Prompt Guard and Code Shield \- that developers should deploy with Llama models or other LLMs. All of our reference implementation demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, visual QA. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, coding or memorization. Red teaming We conduct recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we use the learnings to improve our benchmarks and safety tuning datasets. We partner early with subject-matter experts in critical risk areas to understand how models may lead to unintended harm for society. Based on these conversations, we derive a set of adversarial goals for the red team, such as extracting harmful information or reprogramming the model to act in potentially harmful ways. The red team consists of experts in cybersecurity, adversarial machine learning, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. We spend additional focus on the following critical risk areas: 1\. CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons for Llama 4, we applied expert-designed and other targeted evaluations designed to assess whether the use of Llama 4 could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. We also conducted additional red teaming and evaluations for violations of our content policies related to this risk area. 2\. Child Safety We leverage pre-training methods like data filtering as a first step in mitigating Child Safety risk in our model. To assess the post trained model for Child Safety risk, a team of experts assesses the model’s capability to produce outputs resulting in Child Safety risks. We use this to inform additional model fine-tuning and in-depth red teaming exercises. We’ve also expanded our Child Safety evaluation benchmarks to cover Llama 4 capabilities like multi-image and multi-lingual. 3\. Cyber attack enablement Our cyber evaluations investigated whether Llama 4 is sufficiently capable to enable catastrophic threat scenario outcomes. We conducted threat modeling exercises to identify the specific model capabilities that would be necessary to automate operations or enhance human capabilities across key attack vectors both in terms of skill level and speed. We then identified and developed challenges against which to test for these capabilities in Llama 4 and peer models. Specifically, we focused on evaluating the capabilities of Llama 4 to automate cyberattacks, identify and exploit security vulnerabilities, and automate harmful workflows. Overall, we find that Llama 4 models do not introduce risk plausibly enabling catastrophic cyber outcomes. Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Trust tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Our AI is anchored on the values of freedom of expression \- helping people to explore, debate, and innovate using our technology. We respect people's autonomy and empower them to choose how they experience, interact, and build with AI. Our AI promotes an open exchange of ideas. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 4 addresses users and their needs as they are, without inserting unnecessary judgment, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. Llama 4 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 4’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 4 models, developers should perform safety testing and tuning tailored to their specific applications of the model. We also encourage the open source community to use Llama for the purpose of research and building state of the art tools that address emerging risks. Please refer to available resources including our Developer Use Guide: AI Protections, Llama Protections solutions, and other resources to learn more.

NaNK
llama4
8,780
38

Qwen3-235B-A22B-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-235B-A22B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 235B in total and 22B activated - Number of Paramaters (Non-Embedding): 234B - Number of Layers: 94 - Number of Attention Heads (GQA): 64 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 32,768 natively and 131,072 tokens with YaRN. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-235B-A22B --reasoning-parser qwen3 --tp 8 shell vllm serve Qwen/Qwen3-235B-A22B --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-235B-A22B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) json { ..., "ropescaling": { "ropetype": "yarn", "factor": 4.0, "originalmaxpositionembeddings": 32768 } } shell vllm serve ... --rope-scaling '{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}' --max-model-len 131072 shell python -m sglang.launchserver ... --json-model-override-args '{"ropescaling":{"ropetype":"yarn","factor":4.0,"originalmaxpositionembeddings":32768}}' shell llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 > Unrecognized keys in `ropescaling` for 'ropetype'='yarn': {'originalmaxpositionembeddings'} > @misc{qwen3, title = {Qwen3}, url = {https://qwenlm.github.io/blog/qwen3/}, author = {Qwen Team}, month = {April}, year = {2025} } ```

NaNK
license:apache-2.0
8,726
69

gemma-3-27b-pt

NaNK
8,722
5

DeepSeek-V3.1-GGUF

Learn how to run DeepSeek-V3.1 correctly - Read our Guide . See how DeepSeek-V3.1 Dynamic 3-bit GGUF scores 75.6% on Aider Polyglot here . These quants include our Unsloth chat template fixes, specifically for llama.cpp supported backends. - You must use --jinja for llama.cpp quants - Set the temperature ~0.6 (recommended) and TopP value of 0.95 (recommended) - UD-Q2KXL (247GB) is recommended - For complete detailed instructions, see our guide: unsloth.ai/blog/deepseek-v3.1 DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinking mode. Compared to the previous version, this upgrade brings improvements in multiple aspects: - Hybrid thinking mode: One model supports both thinking mode and non-thinking mode by changing the chat template. - Smarter tool calling: Through post-training optimization, the model's performance in tool usage and agent tasks has significantly improved. - Higher thinking efficiency: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly. DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats. | Model | #Total Params | #Activated Params | Context Length | Download | | :------------: | :------------: | :------------: | :------------: | :------------: | | DeepSeek-V3.1-Base | 671B | 37B | 128K | HuggingFace \| ModelScope | | DeepSeek-V3.1 | 671B | 37B | 128K | HuggingFace \| ModelScope | The details of our chat template is described in `tokenizerconfig.json` and `assets/chattemplate.jinja`. Here is a brief description. With the given prefix, DeepSeek V3.1 generates responses to queries in non-thinking mode. Unlike DeepSeek V3, it introduces an additional token ` `. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` By concatenating the context and the prefix, we obtain the correct prompt for the query. The prefix of thinking mode is similar to DeepSeek-R1. Multi-Turn Context: ` {system prompt} {query} {response} ... {query} {response} ` The multi-turn template is the same with non-thinking multi-turn chat template. It means the thinking token in the last turn will be dropped but the ` ` is retained in every turn of context. ToolCall Toolcall is supported in non-thinking mode. The format is: ` {system prompt}{tooldescription} {query} ` where the tooldescription is Code-Agent We support various code agent frameworks. Please refer to the above toolcall format to create your own code agents. An example is shown in `assets/codeagenttrajectory.html`. Search-Agent We design a specific format for searching toolcall in thinking mode, to support search agent. For complex questions that require accessing external or up-to-date information, DeepSeek-V3.1 can leverage a user-provided search tool through a multi-turn tool-calling process. Please refer to the `assets/searchtooltrajectory.html` and `assets/searchpythontooltrajectory.html` for the detailed template. Evaluation | Category | Benchmark (Metric) | DeepSeek V3.1-NonThinking | DeepSeek V3 0324 | DeepSeek V3.1-Thinking | DeepSeek R1 0528 |----------|----------------------------------|-----------------|---|---|---| | General | | | MMLU-Redux (EM) | 91.8 | 90.5 | 93.7 | 93.4 | | MMLU-Pro (EM) | 83.7 | 81.2 | 84.8 | 85.0 | | GPQA-Diamond (Pass@1) | 74.9 | 68.4 | 80.1 | 81.0 | | Humanity's Last Exam (Pass@1) | - | - | 15.9 | 17.7 |Search Agent| | | BrowseComp | - | - | 30.0 | 8.9 | | BrowseCompzh | - | - | 49.2 | 35.7 | | Humanity's Last Exam (Python + Search) |- | - | 29.8 | 24.8 | | SimpleQA | - | - | 93.4 | 92.3 | Code | | | LiveCodeBench (2408-2505) (Pass@1) | 56.4 | 43.0 | 74.8 | 73.3 | | Codeforces-Div1 (Rating) | - | - | 2091 | 1930 | | Aider-Polyglot (Acc.) | 68.4 | 55.1 | 76.3 | 71.6 | Code Agent| | | SWE Verified (Agent mode) | 66.0 | 45.4 | - | 44.6 | | SWE-bench Multilingual (Agent mode) | 54.5 | 29.3 | - | 30.5 | | Terminal-bench (Terminus 1 framework) | 31.3 | 13.3 | - | 5.7 | Math | | | AIME 2024 (Pass@1) | 66.3 | 59.4 | 93.1 | 91.4 | | AIME 2025 (Pass@1) | 49.8 | 51.3 | 88.4 | 87.5 | | HMMT 2025 (Pass@1) | 33.5 | 29.2 | 84.2 | 79.4 | Note: - Search agents are evaluated with our internal search framework, which uses a commercial search API + webpage filter + 128K context window. Seach agent results of R1-0528 are evaluated with a pre-defined workflow. - SWE-bench is evaluated with our internal code agent framework. The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit DeepSeek-V3 repo for more information about running this model locally. This repository and the model weights are licensed under the MIT License. If you have any questions, please raise an issue or contact us at [email protected].

license:mit
8,565
90

Qwen2.5-1.5B-Instruct

NaNK
license:apache-2.0
8,500
8

whisper-large-v3

license:apache-2.0
8,380
13

llama-2-7b-bnb-4bit

NaNK
llama
8,069
15

Llama-3.2-11B-Vision-Instruct-bnb-4bit

NaNK
mllama
8,032
79

DeepSeek-R1-0528-Qwen3-8B-unsloth-bnb-4bit

NaNK
license:mit
7,881
13

DeepSeek-V3-0324-BF16

license:mit
7,789
4

Qwen2.5-7B

Finetune Llama 3.1, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Qwen 2.5 (all model sizes) free Google Colab Tesla T4 notebook. Also a Qwen 2.5 conversational style notebook. All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.1 8b | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma-2 9b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the base 7B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: 131,072 tokens We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
7,731
7

SmolLM2-135M-Instruct-GGUF

Finetune SmolLM2, Llama 3.2, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.2 (3B) here: https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing unsloth/SmolLM2-135M-Instruct-GGUF For more details on the model, please go to Hugging Face's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Hugging Face team for creating and releasing these models. SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device. The 1.7B variant demonstrates significant advances over its predecessor SmolLM1-1.7B, particularly in instruction following, knowledge, reasoning, and mathematics. It was trained on 11 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new mathematics and coding datasets that we curated and will release soon. We developed the instruct version through supervised fine-tuning (SFT) using a combination of public datasets and our own curated datasets. We then applied Direct Preference Optimization (DPO) using UltraFeedback. The instruct model additionally supports tasks such as text rewriting, summarization and function calling thanks to datasets developed by Argilla such as Synth-APIGen-v0.1.

llama
7,462
12

Magistral-Small-2509

Learn to run Magistral 1.2 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves SOTA performance in model quantization. Read our in-depth guide about Magistral 1.2: docs.unsloth.ai/basics/magistral - Fine-tune Magistral 1.2 for free using our Kaggle notebook here-Reasoning-Conversational.ipynb&accelerator=nvidiaTeslaT4)! - View the rest of our notebooks in our docs here. Building upon Mistral Small 3.2 (2506), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. - Multimodality: The model now has a vision encoder and can take multimodal inputs, extending its reasoning capabilities to vision. - Performance upgrade: Magistral Small 1.2 should give you significatively better performance than Magistral Small 1.1 as seen in the benchmark results. - Better tone and persona: You should experiment better LaTeX and Markdown formatting, and shorter answers on easy general prompts. - Finite generation: The model is less likely to enter infinite generation loops. - Special think tokens: [THINK] and [/THINK] special tokens encapsulate the reasoning content in a thinking chunk. This makes it easier to parse the reasoning trace and prevents confusion when the '[THINK]' token is given as a string in the prompt. - Reasoning prompt: The reasoning prompt is given in the system prompt. - Reasoning: Capable of long chains of reasoning traces before providing an answer. - Multilingual: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi. - Vision: Vision capabilities enable the model to analyze images and reason based on visual content in addition to text. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. Performance might degrade past 40k but Magistral should still give good results. Hence we recommend to leave the maximum model length to 128k and only lower if you encounter low performance. | Model | AIME24 pass@1 | AIME25 pass@1 | GPQA Diamond | Livecodebench (v5) | |--------------------------|---------------|---------------|--------------|--------------------| | Magistral Medium 1.2 | 91.82% | 83.48% | 76.26% | 75.00% | | Magistral Medium 1.1 | 72.03% | 60.99% | 71.46% | 59.35% | | Magistral Medium 1.0 | 73.59% | 64.95% | 70.83% | 59.36% | | Magistral Small 1.2 | 86.14% | 77.34% | 70.07% | 70.88% | | Magistral Small 1.1 | 70.52% | 62.03% | 65.78% | 59.17% | | Magistral Small 1.0 | 70.68% | 62.76% | 68.18% | 55.84% | Please make sure to use: - `topp`: 0.95 - `temperature`: 0.7 - `maxtokens`: 131072 We highly recommend including the following system prompt for the best results, you can edit and customise it if needed for your specific use case. The `[THINK]` and `[/THINK]` are special tokens that must be encoded as such. Please make sure to use mistral-common as the source of truth. Find below examples from libraries supporting `mistral-common`. We invite you to choose, depending on your use case and requirements, between keeping reasoning traces during multi-turn interactions or keeping only the final assistant response. Make sure you install the latest `Transformers` version:

license:apache-2.0
7,412
9

Phi-4-mini-reasoning-GGUF

See our collection for all versions of Phi-4 including GGUF, 4-bit & 16-bit formats. Learn to run Phi-4 reasoning correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Phi-4 (14B) for free using our Google Colab notebook here! - Read our Blog about Phi-4 support with our bug fixes: unsloth.ai/blog/phi4 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Phi-4 (14B) | ▶️ Start on Colab-GRPO.ipynb) | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2x faster | 80% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Phi-4-mini-reasoning is a lightweight open model built upon synthetic data with a focus on high-quality, reasoning dense data further finetuned for more advanced math reasoning capabilities. The model belongs to the Phi-4 model family and supports 128K token context length. 📰 Phi-4-mini-reasoning Blog, and Developer Article 📖 Phi-4-mini-reasoning Technical Report 👩‍🍳 Phi Cookbook 🏡 Phi Portal 🖥️ Try It Azure 🎉Phi-4 models: [Phi-4-reasoning] | [multimodal-instruct | onnx]; [mini-instruct | onnx] Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks under memory/compute constrained environments and latency bound scenarios. Some of the use cases include formal proof generation, symbolic computation, advanced word problems, and a wide range of mathematical reasoning scenarios. These models excel at maintaining context across steps, applying structured logic, and delivering accurate, reliable solutions in domains that require deep analytical thinking. This model is designed and tested for math reasoning only. It is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. This release of Phi-4-mini-reasoning addresses user feedback and market demand for a compact reasoning model. It is a compact transformer-based language model optimized for mathematical reasoning, built to deliver high-quality, step-by-step problem solving in environments where computing or latency is constrained. The model is fine-tuned with synthetic math data from a more capable model (much larger, smarter, more accurate, and better at following instructions), which has resulted in enhanced reasoning performance. Phi-4-mini-reasoning balances reasoning ability with efficiency, making it potentially suitable for educational applications, embedded tutoring, and lightweight deployment on edge or mobile systems. If a critical issue is identified with Phi-4-mini-reasoning, it should be promptly reported through the MSRC Researcher Portal or [email protected] To understand the capabilities, the 3.8B parameters Phi-4-mini-reasoning model was compared with a set of models over a variety of reasoning benchmarks. A high-level overview of the model quality is as follows: | Model | AIME | MATH-500 | GPQA Diamond | |------------------------------------|-------|----------|--------------| | o1-mini | 63.6 | 90.0 | 60.0 | | DeepSeek-R1-Distill-Qwen-7B | 53.3 | 91.4 | 49.5 | | DeepSeek-R1-Distill-Llama-8B | 43.3 | 86.9 | 47.3 | | Bespoke-Stratos-7B | 20.0 | 82.0 | 37.8 | | OpenThinker-7B | 31.3 | 83.0 | 42.4 | | Llama-3.2-3B-Instruct | 6.7 | 44.4 | 25.3 | | Phi-4-Mini (base model, 3.8B) | 10.0 | 71.8 | 36.9 | |Phi-4-mini-reasoning (3.8B) | 57.5 | 94.6 | 52.0 | Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings. Phi-4-mini-reasoning supports a vocabulary size of up to `200064` tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size. Given the nature of the training data, the Phi-4-mini-instruct model is best suited for prompts using specific formats. Below are the two primary formats: This format is used for general conversation and instructions: Phi-4-mini-reasoning has been integrated in the `4.51.3` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. Python 3.8 and 3.10 will work best. List of required packages: Phi-4-mini-reasoning is also available in Azure AI Studio After obtaining the Phi-4-mini-instruct model checkpoints, users can use this sample code for inference. + Architecture: Phi-4-mini-reasoning shares the same architecture as Phi-4-Mini, which has 3.8B parameters and is a dense decoder-only Transformer model. When compared with Phi-3.5-Mini, the major changes with Phi-4-Mini are 200K vocabulary, grouped-query attention, and shared input and output embedding. + Inputs: Text. It is best suited for prompts using the chat format. + Context length: 128K tokens + GPUs: 128 H100-80G + Training time: 2 days + Training data: 150B tokens + Outputs: Generated text + Dates: Trained in February 2024 + Status: This is a static model trained on offline datasets with the cutoff date of February 2025 for publicly available data. + Supported languages: English + Release date: April 2025 The training data for Phi-4-mini-reasoning consists exclusively of synthetic mathematical content generated by a stronger and more advanced reasoning model, Deepseek-R1. The objective is to distill knowledge from this model. This synthetic dataset comprises over one million diverse math problems spanning multiple levels of difficulty (from middle school to Ph.D. level). For each problem in the synthetic dataset, eight distinct solutions (rollouts) were sampled, and only those verified as correct were retained, resulting in approximately 30 billion tokens of math content. The dataset integrates three primary components: 1) a curated selection of high-quality, publicly available math questions and a part of the SFT(Supervised Fine-Tuning) data that was used to train the base Phi-4-Mini model; 2) an extensive collection of synthetic math data generated by the Deepseek-R1 model, designed specifically for high-quality supervised fine-tuning and model distillation; and 3) a balanced set of correct and incorrect answers used to construct preference data aimed at enhancing Phi-4-mini-reasoning's reasoning capabilities by learning more effective reasoning trajectories Hardware Note that by default, the Phi-4-mini-reasoning model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA H100 If you want to run the model on: NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT, DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. Phi-4-Mini-Reasoning was developed in accordance with Microsoft's responsible AI principles. Potential safety risks in the model’s responses were assessed using the Azure AI Foundry’s Risk and Safety Evaluation framework, focusing on harmful content, direct jailbreak, and model groundedness. The Phi-4-Mini-Reasoning Model Card contains additional information about our approach to safety and responsible AI considerations that developers should be aware of when using this model. Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Election Information Reliability : The model has an elevated defect rate when responding to election-critical queries, which may result in incorrect or unauthoritative election critical information being presented. We are working to improve the model's performance in this area. Users should verify information related to elections with the election authority in their region. + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses. + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift. Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. License The model is licensed under the MIT license. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. For all benchmarks, we consider using the same generation configuration such as max sequence length (32768), the same temperature for the fair comparison. Benchmark datasets We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically: - Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving. - AIME 2024: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning. - GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.

license:mit
7,402
52

orpheus-3b-0.1-ft-unsloth-bnb-4bit

NaNK
llama
7,386
14

Qwen3-VL-2B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
7,358
2

medgemma-4b-it-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model on Google Cloud Model Garden: MedGemma Model on Hugging Face: MedGemma GitHub repository (supporting code, Colab notebooks, discussions, and issues): MedGemma Quick start notebook: GitHub Fine-tuning notebook: GitHub Concept applications built using MedGemma: Collection Support: See Contact License: The use of MedGemma is governed by the Health AI Developer Foundations terms of use. This section describes the MedGemma model and how to use it. MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications. MedGemma currently comes in three variants: a 4B multimodal version and 27B text-only and multimodal versions. Both MedGemma multimodal versions utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. Their LLM components are trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma 4B is available in both pre-trained (suffix: `-pt`) and instruction-tuned (suffix `-it`) versions. The instruction-tuned version is a better starting point for most applications. The pre-trained version is available for those who want to experiment more deeply with the models. MedGemma 27B multimodal has pre-training on medical image, medical record and medical record comprehension tasks. MedGemma 27B text-only has been trained exclusively on medical text. Both models have been optimized for inference-time computation on medical reasoning. This means it has slightly higher performance on some text benchmarks than MedGemma 27B multimodal. Users who want to work with a single model for both medical text, medical record and medical image tasks are better suited for MedGemma 27B multimodal. Those that only need text use-cases may be better served with the text-only variant. Both MedGemma 27B variants are only available in instruction-tuned versions. MedGemma variants have been evaluated on a range of clinically relevant benchmarks to illustrate their baseline performance. These evaluations are based on both open benchmark datasets and curated datasets. Developers can fine-tune MedGemma variants for improved performance. Consult the Intended Use section below for more details. MedGemma is optimized for medical applications that involve a text generation component. For medical image-based applications that do not involve text generation, such as data-efficient classification, zero-shot classification, or content-based or semantic image retrieval, the MedSigLIP image encoder is recommended. MedSigLIP is based on the same image encoder that powers MedGemma. Please consult the MedGemma Technical Report for more details. Below are some example code snippets to help you quickly get started running the model locally on GPU. If you want to use the model at scale, we recommend that you create a production version using Model Garden. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0. See the following Colab notebooks for examples of how to use MedGemma: To give the model a quick try, running it locally with weights from Hugging Face, see Quick start notebook in Colab. Note that you will need to use Colab Enterprise to obtain adequate GPU resources to run either 27B model without quantization. For an example of fine-tuning the 4B model, see the Fine-tuning notebook in Colab. The 27B models can be fine tuned in a similar manner but will require more time and compute resources than the 4B model. The MedGemma model is built based on Gemma 3 and uses the same decoder-only transformer architecture as Gemma 3\. To read more about the architecture, consult the Gemma 3 model card. Model type: Decoder-only Transformer architecture, see the Gemma 3 Technical Report Input Modalities: Text, vision Output Modality: Text only Attention mechanism: Grouped-query attention (GQA) Context length: Supports long context, at least 128K tokens Key publication: https://arxiv.org/abs/2507.05201 Model created: July 9, 2025 When using this model, please cite: Sellergren et al. "MedGemma Technical Report." arXiv preprint arXiv:2507.05201 (2025). Text string, such as a question or prompt Images, normalized to 896 x 896 resolution and encoded to 256 tokens each Total input length of 128K tokens Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document Total output length of 8192 tokens MedGemma was evaluated across a range of different multimodal classification, report generation, visual question answering, and text-based tasks. The multimodal performance of MedGemma 4B and 27B multimodal was evaluated across a range of benchmarks, focusing on radiology, dermatology, histopathology, ophthalmology, and multimodal clinical reasoning. MedGemma 4B outperforms the base Gemma 3 4B model across all tested multimodal health benchmarks. | Task and metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | Medical image classification | | | | MIMIC CXR\\ \- macro F1 for top 5 conditions | 81.2 | 88.9 | | CheXpert CXR \- macro F1 for top 5 conditions | 32.6 | 48.1 | | CXR14 \- macro F1 for 3 conditions | 32.0 | 50.1 | | PathMCQA\ (histopathology, internal\\) \- Accuracy | 37.1 | 69.8 | | US-DermMCQA\ \- Accuracy | 52.5 | 71.8 | | EyePACS\ (fundus, internal) \- Accuracy | 14.4 | 64.9 | | Visual question answering | | | | SLAKE (radiology) \- Tokenized F1 | 40.2 | 72.3 | | VQA-RAD\\\ (radiology) \- Tokenized F1 | 33.6 | 49.9 | | Knowledge and reasoning | | | | | | MedXpertQA (text \+ multimodal questions) \- Accuracy | 16.4 | 18.8 | Internal datasets. US-DermMCQA is described in Liu (2020, Nature medicine), presented as a 4-way MCQ per example for skin condition classification. PathMCQA is based on multiple datasets, presented as 3-9 way MCQ per example for identification, grading, and subtype for breast, cervical, and prostate cancer. EyePACS is a dataset of fundus images with classification labels based on 5-level diabetic retinopathy severity (None, Mild, Moderate, Severe, Proliferative). More details in the MedGemma Technical Report. Based on radiologist adjudicated labels, described in Yang (2024, arXiv) Section A.1.1. Based on "balanced split," described in Yang (2024, arXiv). MedGemma chest X-ray (CXR) report generation performance was evaluated on MIMIC-CXR using the RadGraph F1 metric. We compare the MedGemma pre-trained checkpoint with our previous best model for CXR report generation, PaliGemma 2. | Metric | MedGemma 4B (pre-trained) | MedGemma 4B (tuned for CXR)| PaliGemma 2 3B (tuned for CXR) | PaliGemma 2 10B (tuned for CXR) | | :---- | :---- | :---- | :---- | :---- | | MIMIC CXR \- RadGraph F1 | 29.5 | 30.3 |28.8 | 29.5 | The instruction-tuned versions of MedGemma 4B and MedGemma 27B achieve lower scores (21.9 and 21.3, respectively) due to the differences in reporting style compared to the MIMIC ground truth reports. Further fine-tuning on MIMIC reports enables users to achieve improved performance, as shown by the improved performance of the MedGemma 4B model that was tuned for CXR. MedGemma 4B and text-only MedGemma 27B were evaluated across a range of text-only benchmarks for medical knowledge and reasoning. The MedGemma models outperform their respective base Gemma models across all tested text-only health benchmarks. | Metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | MedQA (4-op) | 50.7 | 64.4 | | MedMCQA | 45.4 | 55.7 | | PubMedQA | 68.4 | 73.4 | | MMLU Med | 67.2 | 70.0 | | MedXpertQA (text only) | 11.6 | 14.2 | | AfriMed-QA (25 question test set) | 48.0 | 52.0 | For all MedGemma 27B results, test-time scaling is used to improve performance. All models were evaluated on a question answer dataset from synthetic FHIR data to answer questions about patient records. MedGemma 27B multimodal's FHIR-specific training gives it significant improvement over other MedGemma and Gemma models. | Metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | EHRQA | 70.9 | 67.6 | Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma 4B utilizes a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Its LLM component is trained on a diverse set of medical data, including medical text relevant to radiology images, chest-x rays, histopathology patches, ophthalmology images and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 5 different tasks and 6 medical image modalities. These include both open benchmark datasets and curated datasets, with a focus on expert human evaluations for tasks like CXR report generation and radiology VQA. Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma multimodal variants utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Their LLM component is trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 6 different tasks and 4 medical image modalities. These benchmarks include both open and internal datasets. MedGemma utilizes a combination of public and private datasets. This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), ChestImaGenome: Set of bounding boxes linking image findings with anatomical regions for MIMIC-CXR (MedGemma 27B multimodal only), SLAKE (multimodal medical images and questions), PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays). Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next). MIMIC-CXR: MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC). Slake-VQA: The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital. PAD-UFES-20: Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD). SCIN: A collaboration between Google Health and Stanford Medicine. TCGA (The Cancer Genome Atlas): A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC) CAMELYON: The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands. PMC-OA (PubMed Central Open Access Subset): Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH. MedQA: This dataset was created by a team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits Mendeley Digital Knee X-Ray: This dataset is from Rani Channamma University, and is hosted on Mendeley Data. AfriMed-QA: This data was developed and led by multiple collaborating organizations and researchers include key contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of Technology, and MasakhaneNLP. VQA-RAD: This dataset was created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman and their affiliated institutions (the US National Library of Medicine and National Institutes of Health) Chest ImaGenome: IBM Research. MedExpQA: This dataset was created by researchers at the HiTZ Center (Basque Center for Language Technology and Artificial Intelligence). MedXpertQA: This dataset was developed by researchers at Tsinghua University (Beijing, China) and Shanghai Artificial Intelligence Laboratory (Shanghai, China). HealthSearchQA: This dataset consists of consisting of 3,173 commonly searched consumer questions In addition to the public datasets listed above, MedGemma was also trained on de-identified, licensed datasets or datasets collected internally at Google from consented participants. Radiology dataset 1: De-identified dataset of different CT studies across body parts from a US-based radiology outpatient diagnostic center network. Ophthalmology dataset 1 (EyePACS): De-identified dataset of fundus images from diabetic retinopathy screening. Dermatology dataset 1: De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia. Dermatology dataset 2: De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia. Dermatology dataset 3: De-identified dataset of non-diseased skin images from an internal data collection effort. Pathology dataset 1: De-identified dataset of histopathology H\&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes. Pathology dataset 2: De-identified dataset of lung histopathology H\&E and IHC whole slide images created by a commercial biobank in the United States. Pathology dataset 3: De-identified dataset of prostate and lymph node H\&E and IHC histopathology whole slide images created by a contract research organization in the United States. Pathology dataset 4: De-identified dataset of histopathology whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H\&E. EHR dataset 1: Question/answer dataset drawn from synthetic FHIR records created by Synthea. The test set includes 19 unique patients with 200 questions per patient divided into 10 different categories. MIMIC-CXR: Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. https://physionet.org/content/mimic-cxr/2.1.0/ and Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019\. "MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports." Scientific Data 6 (1): 1–8. SLAKE: Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering." http://arxiv.org/abs/2102.09542. PAD-UEFS-20: Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones." Data in brief 32 (2020): 106221\. SCIN: Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, et al. 2024\. "Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements." JAMA Network Open 7 (11): e2446615–e2446615. TCGA: The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. CAMELYON16: Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. 2017\. "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer." JAMA 318 (22): 2199–2210. Mendeley Digital Knee X-Ray: Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1 VQA-RAD: Lau, Jason J., Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018\. "A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images." Scientific Data 5 (1): 1–10. Chest ImaGenome: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. RRID:SCR\007345. https://doi.org/10.13026/wv01-y230 MedQA: Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020\. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." http://arxiv.org/abs/2009.13081. AfrimedQA: Olatunji, Tobi, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, et al. 2024\. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." http://arxiv.org/abs/2411.15640. MedExpQA: Alonso, I., Oronoz, M., & Agerri, R. (2024). MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering. arXiv preprint arXiv:2404.05590. Retrieved from https://arxiv.org/abs/2404.05590 MedXpertQA: Zuo, Yuxin, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025\. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." http://arxiv.org/abs/2501.18362. Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. MedGemma is an open multimodal generative AI model intended to be used as a starting point that enables more efficient development of downstream healthcare applications involving medical text and images. MedGemma is intended for developers in the life sciences and healthcare space. Developers are responsible for training, adapting and making meaningful changes to MedGemma to accomplish their specific intended use. MedGemma models can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions. MedGemma is based on Gemma 3 and has been further trained on medical images and text. MedGemma enables further development in any medical context (image and textual), however the model was pre-trained using chest X-ray, pathology, dermatology, and fundus images. Examples of tasks within MedGemma's training include visual question answering pertaining to medical images, such as radiographs, or providing answers to textual medical questions. Full details of all the tasks MedGemma has been evaluated can be found in the MedGemma Technical Report. Provides strong baseline medical image and text comprehension for models of its size. This strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training. This adaptation may involve prompt engineering, grounding, agentic orchestration or fine-tuning depending on the use case, baseline validation requirements, and desired performance characteristics. MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case. The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Performance benchmarks highlight baseline capabilities on relevant benchmarks, but even for image and text domains that constitute a substantial portion of training data, inaccurate model output is possible. All outputs from MedGemma should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies. MedGemma's multimodal capabilities have been primarily evaluated on single-image tasks. MedGemma has not been evaluated in use cases that involve comprehension of multiple images. MedGemma has not been evaluated or optimized for multi-turn applications. MedGemma's training may make it more sensitive to the specific prompt used than Gemma 3\. When adapting MedGemma developer should consider the following: Bias in validation data: As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc). Data contamination concerns: When evaluating the generalization capabilities of a large model like MedGemma in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedGemma on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk. May 20, 2025: Initial Release July 9, 2025 Bug Fix: Fixed the subtle degradation in the multimodal performance. The issue was due to a missing end-of-image token in the model vocabulary, impacting combined text-and-image tasks. This fix reinstates and correctly maps that token, ensuring text-only tasks remain unaffected while restoring multimodal performance.

NaNK
7,247
46

Mistral-Small-Instruct-2409-bnb-4bit

NaNK
7,224
4

Phi-4-reasoning-plus-GGUF

license:mit
7,173
74

gemma-2-2b-bnb-4bit

Finetune Gemma, Llama 3, Mistral 2-5x faster with 70% less memory via Unsloth! We have a Google Colab Tesla T4 notebook for Gemma 2 (9B) here: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Qwen2 VL (7B) | ▶️ Start on Colab | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab | 2x faster | 60% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab | 2.2x faster | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

NaNK
7,004
11

gemma-3n-E4B-it

NaNK
6,979
10

Meta-Llama-3.1-70B-Instruct-bnb-4bit

NaNK
llama
6,952
31

Llama-3.1-8B-bnb-4bit

NaNK
llama
6,927
0

Seed-OSS-36B-Instruct-GGUF

You can get to know us better through the following channels👇 > [!NOTE] > This model card is dedicated to the `Seed-OSS-36B-Instruct` model. News - [2025/08/20]🔥We release `Seed-OSS-36B-Base` (both with and without synthetic data versions) and `Seed-OSS-36B-Instruct`. Introduction Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. Although trained with only 12T tokens, Seed-OSS achieves excellent performance on several popular open benchmarks. We release this series of models to the open-source community under the Apache-2.0 license. > [!NOTE] > Seed-OSS is primarily optimized for international (i18n) use cases. Key Features - Flexible Control of Thinking Budget: Allowing users to flexibly adjust the reasoning length as needed. This capability of dynamically controlling the reasoning length enhances inference efficiency in practical application scenarios. - Enhanced Reasoning Capability: Specifically optimized for reasoning tasks while maintaining balanced and excellent general capabilities. - Agentic Intelligence: Performs exceptionally well in agentic tasks such as tool-using and issue resolving. - Research-Friendly: Given that the inclusion of synthetic instruction data in pre-training may affect the post-training research, we released pre-trained models both with and without instruction data, providing the research community with more diverse options. - Native Long Context: Trained with up-to-512K long context natively. Seed-OSS adopts the popular causal language model architecture with RoPE, GQA attention, RMSNorm and SwiGLU activation. | | | |:---:|:---:| | | Seed-OSS-36B | | Parameters | 36B | | Attention | GQA | | Activation Function | SwiGLU | | Number of Layers | 64 | | Number of QKV Heads | 80 / 8 / 8 | | Head Size | 128 | | Hidden Size | 5120 | | Vocabulary Size | 155K | | Context Length | 512K | | RoPE Base Frequency | 1e7 | Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as `Seed-OSS-36B-Base`. We also release `Seed-OSS-36B-Base-woSyn` trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data. Benchmark Seed1.6-Base Qwen3-30B-A3B-Base-2507 Qwen2.5-32B-Base Seed-OSS-36B-Base ( w/ syn. ) Seed-OSS-36B-Base-woSyn ( w/o syn. ) - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Benchmark Seed1.6-Thinking-0715 OAI-OSS-20B Qwen3-30B-A3B-Thinking-2507 Qwen3-32B Gemma3-27B Seed-OSS-36B-Instruct GPQA-D 80.7 72.2 (71.5) 71.4 (73.4) 66.7 (68.4) 42.4 71.4 LiveCodeBench v6 (02/2025-05/2025) 66.8 63.8 60.3 (66) 53.4 - 67.4 SWE-Bench Verified (OpenHands) 41.8 (60.7) 31 23.4 - 56 SWE-Bench Verified (AgentLess 410) 48.4 - 33.5 39.7 - 47 - Bold denotes open-source SOTA. Underlined indicates the second place in the open-source model. - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Some results have been omitted due to the failure of the evaluation run. - The results of Gemma3-27B are sourced directly from its technical report. - Generation configs for Seed-OSS-36B-Instruct: temperature=1.1, topp=0.95. Specifically, for Taubench, temperature=1, topp=0.7. > [!NOTE] > We recommend sampling with `temperature=1.1` and `topp=0.95`. Users can flexibly specify the model's thinking budget. The figure below shows the performance curves across different tasks as the thinking budget varies. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. Here is an example with a thinking budget set to 512: during the reasoning process, the model periodically triggers self-reflection to estimate the consumed and remaining budget, and delivers the final response once the budget is exhausted or the reasoning concludes. If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. Download Seed-OSS checkpoint to `./Seed-OSS-36B-Instruct` Transformers The `generate.py` script provides a simple interface for model inference with configurable options. Key Parameters | Parameter | Description | |-----------|-------------| | `--modelpath` | Path to the pretrained model directory (required) | | `--prompts` | Input prompts (default: sample cooking/code questions) | | `--maxnewtokens` | Maximum tokens to generate (default: 4096) | | `--attnimplementation` | Attention mechanism: `flashattention2` (default) or `eager` | | `--loadin4bit/8bit` | Enable 4-bit/8-bit quantization (reduces memory usage) | | `--thinkingbudget` | Thinking budget in tokens (default: -1 for unlimited budget) | - First install vLLM with Seed-OSS support version: License This project is licensed under Apache-2.0. See the LICENSE flie for details. Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

NaNK
license:apache-2.0
6,918
34

Qwen3-VL-30B-A3B-Thinking-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
6,900
2

ERNIE-4.5-21B-A3B-PT-GGUF

NaNK
license:apache-2.0
6,898
12

Llama-3.3-70B-Instruct

See our collection for all versions of Llama 3.3 including GGUF, 4-bit and original 16-bit formats. Finetune Llama 3.3, Gemma 2, Mistral 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1(8B)-Alpaca.ipynb unsloth/Llama-3.3-70B-Instruct For more details on the model, please go to Meta's original model card All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Phi-3.5 (mini) | ▶️ Start on Colab | 2x faster | 50% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks A huge thank you to the Meta and Llama team for creating and releasing these models The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. | | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Token count | Knowledge cutoff | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Llama 3.3 (text only) | A new mix of publicly available online data. | 70B | Multilingual Text | Multilingual Text and code | 128k | Yes | 15T+ | December 2023 | Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. Llama 3.3 model. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback. License A custom commercial license, the Llama 3.3 Community License Agreement, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3\3/LICENSE Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.3 in applications, please go here. Intended Use Cases Llama 3.3 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.3 model also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.3 Community License allows for these use cases. Out-of-scope Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.3 Community License. Use in languages beyond those explicitly referenced as supported in this model card\\. \\Note: Llama 3.3 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.3 models for languages beyond the 8 supported languages provided they comply with the Llama 3.3 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.3 in additional languages is done in a safe and responsible manner. This repository contains two versions of Llama-3.3-70B-Instruct, for use with transformers and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. LLaMA-3.3 supports multiple tool use formats. You can see a full guide to prompt formatting here. Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool: You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation. The model checkpoints can be used in `8-bit` and `4-bit` for further memory optimisations using `bitsandbytes` and `transformers` To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure. Training Energy Use Training utilized a cumulative of 39.3M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq. | | Training Time (GPU hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) | | :---- | :---: | :---: | :---: | :---: | | Llama 3.3 70B | 7.0M | 700 | 2,040 | 0 | The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.3 was pretrained on \~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023\. In this section, we report the results for Llama 3.3 relative to our previous models. | Category | Benchmark | \# Shots | Metric | Llama 3.1 8B Instruct | Llama 3.1 70B Instruct | Llama-3.3 70B Instruct | Llama 3.1 405B Instruct | | :---- | :---- | ----- | :---- | ----- | ----- | ----- | ----- | | | MMLU (CoT) | 0 | macro\avg/acc | 73.0 | 86.0 | 86.0 | 88.6 | | | MMLU Pro (CoT) | 5 | macro\avg/acc | 48.3 | 66.4 | 68.9 | 73.3 | | Steerability | IFEval | | | 80.4 | 87.5 | 92.1 | 88.6 | | Reasoning | GPQA Diamond (CoT) | 0 | acc | 31.8 | 48.0 | 50.5 | 49.0 | | Code | HumanEval | 0 | pass@1 | 72.6 | 80.5 | 88.4 | 89.0 | | | MBPP EvalPlus (base) | 0 | pass@1 | 72.8 | 86.0 | 87.6 | 88.6 | | Math | MATH (CoT) | 0 | sympy\intersection\score | 51.9 | 68.0 | 77.0 | 73.8 | | Tool Use | BFCL v2 | 0 | overall\ast\summary/macro\avg/valid | 65.4 | 77.5 | 77.3 | 81.1 | | Multilingual | MGSM | 0 | em | 68.9 | 86.9 | 91.1 | 91.6 | Responsibility & Safety As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm. Provide protections for the community to help prevent the misuse of our models. Responsible deployment Llama is a foundational technology designed to be used in a variety of use cases, examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models enabling the world to benefit from the technology power, by aligning our model safety for the generic use cases addressing a standard set of harms. Developers are then in the driver seat to tailor safety for their use case, defining their own policy and deploying the models with the necessary safeguards in their Llama systems. Llama 3.3 was developed following the best practices outlined in our Responsible Use Guide, you can refer to the Responsible Use Guide to learn more. Llama 3.3 instruct Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. Fine-tuning data We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals and Tone Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Llama 3.3 systems Large language models, including Llama 3.3, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard 3, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. New capabilities Note that this release introduces new capabilities, including a longer context window, multilingual inputs and outputs and possible integrations by developers with third party tools. Building with these new capabilities requires specific considerations in addition to the best practices that generally apply across all Generative AI use cases. Tool-use: Just like in standard software development, developers are responsible for the integration of the LLM with the tools and services of their choice. They should define a clear policy for their use case and assess the integrity of the third party services they use to be aware of the safety and security limitations when using this capability. Refer to the Responsible Use Guide for best practices on the safe deployment of the third party safeguards. Multilinguality: Llama 3.3 supports 7 languages in addition to English: French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Llama may be able to output text in other languages than those that meet performance thresholds for safety and helpfulness. We strongly discourage developers from using this model to converse in non-supported languages without implementing finetuning and system controls in alignment with their policies and the best practices shared in the Responsible Use Guide. Evaluations We evaluated Llama models for common use cases as well as specific capabilities. Common use cases evaluations measure safety risks of systems for most commonly built applications including chat bot, coding assistant, tool calls. We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Llama Guard 3 to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Prompt Guard and Code Shield are also available if relevant to the application. Capability evaluations measure vulnerabilities of Llama models inherent to specific capabilities, for which were crafted dedicated benchmarks including long context, multilingual, tools calls, coding or memorization. Red teaming For both scenarios, we conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. . Critical and other risks We specifically focused our efforts on mitigating the following critical risk areas: 1- CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive materials) helpfulness To assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of the Llama 3.3 model could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons. 2\. Child Safety Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. 3\. Cyber attack enablement Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Community Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Ethical Considerations and Limitations The core values of Llama 3.3 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.3 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. But Llama 3.3 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.3’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.3 model, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

NaNK
llama
6,868
48

Phi-3.5-mini-instruct-bnb-4bit

NaNK
llama
6,722
13

Qwen3-8B-Base-unsloth-bnb-4bit

NaNK
license:apache-2.0
6,618
3

Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit

NaNK
mllama
6,506
28

DeepSeek-OCR

Read our Guide How to: Run & Fine-tune DeepSeek-OCR! This DeepSeek-OCR upload was edited to enable inference & fine-tuning on the latest transformers (no accuracy change). Read more - Thank you to Prithiv's model modifcations that enables DeepSeek-OCR fine-tuning. - Fine-tune DeepSeek-OCR for free using our Google Colab notebook.ipynb) - View the rest of our notebooks in our docs here. 🌟 Github | 📥 Model Download | 📄 Paper Link | 📄 Arxiv Paper Link | Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8: vLLM Refer to 🌟GitHub for guidance on model inference acceleration and PDF processing, etc. [2025/10/23] 🚀🚀🚀 DeepSeek-OCR is now officially supported in upstream vLLM. We would like to thank Vary, GOT-OCR2.0, MinerU, PaddleOCR, OneChart, Slow Perception for their valuable models and ideas. We also appreciate the benchmarks: Fox, OminiDocBench. Citation ```bibtex @article{wei2025deepseek, title={DeepSeek-OCR: Contexts Optical Compression}, author={Wei, Haoran and Sun, Yaofeng and Li, Yukun}, journal={arXiv preprint arXiv:2510.18234}, year={2025} }

license:mit
6,460
25

Qwen2.5-Coder-7B-Instruct

NaNK
license:apache-2.0
6,425
7

Mistral-Small-3.2-24B-Instruct-2506-unsloth-bnb-4bit

NaNK
license:apache-2.0
6,364
10

gemma-3-270m-it-qat-GGUF

> [!NOTE] > Please use the correct settings: `temperature = 1.0, topk = 64, topp = 0.95, minp = 0.0` > See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Read our Guide to see how to Run Gemma 3 correctly. - Fine-tune Gemma 3 (270M) for free using our Google Colab notebook here.ipynb)! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma 3 (4B) | ▶️ Start on Colab.ipynb) | 2x faster | 80% less | | Gemma-3n-E4B | ▶️ Start on Colab-Conversational.ipynb) | 2x faster | 60% less | | Gemma-3n-E4B (Audio) | ▶️ Start on Colab-Audio.ipynb) | 2x faster | 60% less | | GRPO with Gemma 3 (1B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Gemma 3 (4B) Vision | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each, for the 4B, 12B, and 27B sizes. - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | n-shot | Gemma 3 PT 270M | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 10-shot | 40.9 | | [BoolQ][boolq] | 0-shot | 61.4 | | [PIQA][piqa] | 0-shot | 67.7 | | [TriviaQA][triviaqa] | 5-shot | 15.4 | | [ARC-c][arc] | 25-shot | 29.0 | | [ARC-e][arc] | 0-shot | 57.7 | | [WinoGrande][winogrande] | 5-shot | 52.0 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [triviaqa]: https://arxiv.org/abs/1705.03551 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 | Benchmark | n-shot | Gemma 3 IT 270m | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 0-shot | 37.7 | | [PIQA][piqa] | 0-shot | 66.2 | | [ARC-c][arc] | 0-shot | 28.2 | | [WinoGrande][winogrande] | 0-shot | 52.3 | | [BIG-Bench Hard][bbh] | few-shot | 26.7 | | [IF Eval][ifeval] | 0-shot | 51.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [piqa]: https://arxiv.org/abs/1911.11641 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [bbh]: https://paperswithcode.com/dataset/bbh [ifeval]: https://arxiv.org/abs/2311.07911 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [GPQA][gpqa] Diamond | 0-shot | 19.2 | 30.8 | 40.9 | 42.4 | | [SimpleQA][simpleqa] | 0-shot | 2.2 | 4.0 | 6.3 | 10.0 | | [FACTS Grounding][facts-grdg] | - | 36.4 | 70.1 | 75.8 | 74.9 | | [BIG-Bench Hard][bbh] | 0-shot | 39.1 | 72.2 | 85.7 | 87.6 | | [BIG-Bench Extra Hard][bbeh] | 0-shot | 7.2 | 11.0 | 16.3 | 19.3 | | [IFEval][ifeval] | 0-shot | 80.2 | 90.2 | 88.9 | 90.4 | | Benchmark | n-shot | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [gpqa]: https://arxiv.org/abs/2311.12022 [simpleqa]: https://arxiv.org/abs/2411.04368 [facts-grdg]: https://goo.gle/FACTSpaper [bbeh]: https://github.com/google-deepmind/bbeh [ifeval]: https://arxiv.org/abs/2311.07911 [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] (Pro) | 0-shot | 14.7 | 43.6 | 60.6 | 67.5 | | [LiveCodeBench][lcb] | 0-shot | 1.9 | 12.6 | 24.6 | 29.7 | | [Bird-SQL][bird-sql] (dev) | - | 6.4 | 36.3 | 47.9 | 54.4 | | [Math][math] | 0-shot | 48.0 | 75.6 | 83.8 | 89.0 | | HiddenMath | 0-shot | 15.8 | 43.0 | 54.5 | 60.3 | | [MBPP][mbpp] | 3-shot | 35.2 | 63.2 | 73.0 | 74.4 | | [HumanEval][humaneval] | 0-shot | 41.5 | 71.3 | 85.4 | 87.8 | | [Natural2Code][nat2code] | 0-shot | 56.0 | 70.3 | 80.7 | 84.5 | | [GSM8K][gsm8k] | 0-shot | 62.8 | 89.2 | 94.4 | 95.9 | | Benchmark | n-shot | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [bird-sql]: https://arxiv.org/abs/2305.03111 [nat2code]: https://arxiv.org/abs/2405.04520 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [Global-MMLU-Lite][global-mmlu-lite] | 0-shot | 34.2 | 54.5 | 69.5 | 75.1 | | [ECLeKTic][eclektic] | 0-shot | 1.4 | 4.6 | 10.3 | 16.7 | | [WMT24++][wmt24pp] | 0-shot | 35.9 | 46.8 | 51.6 | 53.4 | | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |-----------------------------------|:-------------:|:--------------:|:--------------:| | [MMMU][mmmu] (val) | 48.8 | 59.6 | 64.9 | | [DocVQA][docvqa] | 75.8 | 87.1 | 86.6 | | [InfoVQA][info-vqa] | 50.0 | 64.9 | 70.6 | | [TextVQA][textvqa] | 57.8 | 67.7 | 65.1 | | [AI2D][ai2d] | 74.8 | 84.2 | 84.5 | | [ChartQA][chartqa] | 68.8 | 75.7 | 78.0 | | [VQAv2][vqav2] (val) | 62.4 | 71.6 | 71.0 | | [MathVista][mathvista] (testmini) | 50.0 | 62.9 | 67.6 | | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ [mathvista]: https://arxiv.org/abs/2310.02255 Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://arxiv.org/abs/2503.19786 [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

6,299
11

DeepSeek-R1-Distill-Qwen-7B-GGUF

NaNK
license:apache-2.0
6,243
88

Qwen3-VL-30B-A3B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
6,178
2

gemma-3-4b-it-qat-GGUF

NaNK
6,080
22

DeepSeek-R1-Distill-Qwen-1.5B-unsloth-bnb-4bit

NaNK
license:apache-2.0
5,936
17

phi-4-bnb-4bit

NaNK
llama
5,930
16

GLM-4-32B-0414-GGUF

NaNK
license:mit
5,787
51

gemma-3n-E2B-it-unsloth-bnb-4bit

NaNK
5,713
8

Apertus-8B-Instruct-2509-GGUF

NaNK
license:apache-2.0
5,707
11

Qwen3-8B-128K-GGUF

NaNK
license:apache-2.0
5,677
23

tinyllama-bnb-4bit

NaNK
llama
5,650
12

gemma-2-2b-it-bnb-4bit

NaNK
5,566
20

MiMo-VL-7B-RL-GGUF

NaNK
license:mit
5,468
13

Qwen3-1.7B-Base

NaNK
license:apache-2.0
5,380
4

embeddinggemma-300m

Responsible Generative AI Toolkit EmbeddingGemma on Kaggle EmbeddingGemma on Vertex Model Garden EmbeddingGemma is a 300M parameter, state-of-the-art for its size, open embedding model from Google, built from Gemma 3 (with T5Gemma initialization) and the same research and technology used to create Gemini models. EmbeddingGemma produces vector representations of text, making it well-suited for search and retrieval tasks, including classification, clustering, and semantic similarity search. This model was trained with data in 100+ spoken languages. The small size and on-device focus makes it possible to deploy in environments with limited resources such as mobile phones, laptops, or desktops, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be embedded - Maximum input context length of 2048 tokens - Output: - Numerical vector representations of input text data - Output embedding dimension size of 768, with smaller options available (512, 256, or 128) via Matryoshka Representation Learning (MRL). MRL allows users to truncate the output embedding of size 768 to their desired size and then re-normalize for efficient and accurate representation. These model weights are designed to be used with Sentence Transformers, using the Gemma 3 implementation from Hugging Face Transformers as the backbone. NOTE: EmbeddingGemma activations do not support `float16`. Please use `float32` or `bfloat16` as appropriate for your hardware. This model was trained on a dataset of text data that includes a wide variety of sources totaling approximately 320 billion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 100 languages. - Code and Technical Documents: Exposing the model to code and technical documentation helps it learn the structure and patterns of programming languages and specialized scientific content, which improves its understanding of code and technical questions. - Synthetic and Task-Specific Data: Synthetically training data helps to teach the model specific skills. This includes curated data for tasks like information retrieval, classification, and sentiment analysis, which helps to fine-tune its performance for common embedding applications. The combination of these diverse data sources is crucial for training a powerful multilingual embedding model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with our policies. EmbeddingGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e), for more details refer to the Gemma 3 model card. Training was done using JAX and ML Pathways. For more details refer to the Gemma 3 model card. The model was evaluated against a large collection of different datasets and metrics to cover different aspects of text understanding. Quant config (dimensionality) Mean (Task) Mean (TaskType) Quant config (dimensionality) Mean (Task) Mean (TaskType) Quant config (dimensionality) Mean (Task) Mean (TaskType) \ Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4a8f4p4). EmbeddingGemma can generate optimized embeddings for various use cases—such as document retrieval, question answering, and fact verification—or for specific input types—either a query or a document—using prompts that are prepended to the input strings. Query prompts follow the form `task: {task description} | query: ` where the task description varies by the use case, with the default task description being `search result`. Document-style prompts follow the form `title: {title | "none"} | text: ` where the title is either `none` (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting. Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice. Used to generate embeddings that are optimized for document search or information retrieval Used to generate embeddings that are optimized to classify texts according to preset labels Used to generate embeddings that are optimized to cluster texts based on their similarities Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases. Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list . Embeddings of the code blocks are computed using retrievaldocument. These models have certain limitations that users should be aware of. Open embedding models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Semantic Similarity: Embeddings optimized to assess text similarity, such as recommendation systems and duplicate detection - Classification: Embeddings optimized to classify texts according to preset labels, such as sentiment analysis and spam detection - Clustering: Embeddings optimized to cluster texts based on their similarities, such as document organization, market research, and anomaly detection - Retrieval - Document: Embeddings optimized for document search, such as indexing articles, books, or web pages for search - Query: Embeddings optimized for general search queries, such as custom search - Code Query: Embeddings optimized for retrieval of code blocks based on natural language queries, such as code suggestions and search - Question Answering: Embeddings for questions in a question-answering system, optimized for finding documents that answer the question, such as chatbox. - Fact Verification: Embeddings for statements that need to be verified, optimized for retrieving documents that contain evidence supporting or refuting the statement, such as automated fact-checking systems. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of embeddings. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the Gemma Prohibited Use Policy. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open embedding model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown superior performance to other, comparably-sized open model alternatives.

5,370
6

Qwen3-30B-A3B-Instruct-2507

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
5,291
12

Mistral-Small-3.1-24B-Instruct-2503-bnb-4bit

NaNK
license:apache-2.0
5,167
1

Meta-Llama-3.1-70B-bnb-4bit

NaNK
llama
5,123
31

Qwen2.5-Coder-7B-Instruct-128K-GGUF

NaNK
license:apache-2.0
5,109
19

Qwen2.5-Omni-3B-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Overview Introduction Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. Omni and Novel Architecture: We propose Thinker-Talker architecture, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. We propose a novel position embedding, named TMRoPE (Time-aligned Multimodal RoPE), to synchronize the timestamps of video inputs with audio. Real-Time Voice and Video Chat: Architecture designed for fully real-time interactions, supporting chunked input and immediate output. Natural and Robust Speech Generation: Surpassing many existing streaming and non-streaming alternatives, demonstrating superior robustness and naturalness in speech generation. Strong Performance Across Modalities: Exhibiting exceptional performance across all modalities when benchmarked against similarly sized single-modality models. Qwen2.5-Omni outperforms the similarly sized Qwen2-Audio in audio capabilities and achieves comparable performance to Qwen2.5-VL-7B. Excellent End-to-End Speech Instruction Following: Qwen2.5-Omni shows performance in end-to-end speech instruction following that rivals its effectiveness with text inputs, evidenced by benchmarks such as MMLU and GSM8K. We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates strong performance across all modalities when compared to similarly sized single-modality models and closed-source models like Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In tasks requiring the integration of multiple modalities, such as OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Furthermore, in single-modality tasks, it excels in areas including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness). OmniBench Speech | Sound Event | Music | Avg Gemini-1.5-Pro 42.67%|42.26%|46.23%|42.91% Librispeech dev-clean | dev other | test-clean | test-other SALMONN -|-|2.1|4.9 Common Voice 15 en | zh | yue | fr Whisper-large-v3 9.3|12.8|10.9|10.8 Wenetspeech test-net | test-meeting Seed-ASR-Chinese 4.7|5.7 CoVoST2 en-de | de-en | en-zh | zh-en SALMONN 18.6|-|33.1|- MusicCaps LP-MusicCaps 0.291|0.149|0.089| 0.061 |0.129|0.130 Qwen2.5-Omni-3B 0.325| 0.163 | 0.093 |0.057| 0.132 | 0.229 Qwen2.5-Omni-7B 0.328 |0.162|0.090|0.055|0.127|0.225 MMAU Sound | Music | Speech | Avg Gemini-Pro-V1.5 56.75|49.40|58.55|54.90 VoiceBench AlpacaEval | CommonEval | SD-QA | MMSU Ultravox-v0.4.1-LLaMA-3.1-8B 4.55 |3.90|53.35|47.17 VoiceBench OpenBookQA | IFEval | AdvBench | Avg Ultravox-v0.4.1-LLaMA-3.1-8B 65.27| 66.88 |98.46|71.45 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |--------------------------------|--------------|------------|------------|---------------|-------------| | MMMU val | 59.2 | 53.1 | 53.9 | 58.6 | 60.0 | | MMMU-Pro overall | 36.6 | 29.7 | - | 38.3 | 37.6 | | MathVista testmini | 67.9 | 59.4 | 71.9 | 68.2 | 52.5 | | MathVision full | 25.0 | 20.8 | 23.1 | 25.1 | - | | MMBench-V1.1-EN test | 81.8 | 77.8 | 80.5 | 82.6 | 76.0 | | MMVet turbo | 66.8 | 62.1 | 67.5 | 67.1 | 66.9 | | MMStar | 64.0 | 55.7 | 64.0 | 63.9 | 54.8 | | MME sum | 2340 | 2117 | 2372 | 2347 | 2003 | | MuirBench | 59.2 | 48.0 | - | 59.2 | - | | CRPE relation | 76.5 | 73.7 | - | 76.4 | - | | RealWorldQA avg | 70.3 | 62.6 | 71.9 | 68.5 | - | | MME-RealWorld en | 61.6 | 55.6 | - | 57.4 | - | | MM-MT-Bench | 6.0 | 5.0 | - | 6.3 | - | | AI2D | 83.2 | 79.5 | 85.8 | 83.9 | - | | TextVQA val | 84.4 | 79.8 | 83.2 | 84.9 | - | | DocVQA test | 95.2 | 93.3 | 93.5 | 95.7 | - | | ChartQA test Avg | 85.3 | 82.8 | 84.9 | 87.3 | - | | OCRBenchV2 en | 57.8 | 51.7 | - | 56.3 | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro | |--------------------------|--------------|---------------|---------------|----------------|----------------| | Refcoco val | 90.5 | 88.7 | 90.0 | 90.6 | 73.2 | | Refcoco textA | 93.5 | 91.8 | 92.5 | 93.2 | 72.9 | | Refcoco textB | 86.6 | 84.0 | 85.4 | 88.2 | 74.6 | | Refcoco+ val | 85.4 | 81.1 | 84.2 | 88.2 | 62.5 | | Refcoco+ textA | 91.0 | 87.5 | 89.1 | 89.0 | 63.9 | | Refcoco+ textB | 79.3 | 73.2 | 76.9 | 75.9 | 65.0 | | Refcocog+ val | 87.4 | 85.0 | 87.2 | 86.1 | 75.2 | | Refcocog+ test | 87.9 | 85.1 | 87.2 | 87.0 | 76.2 | | ODinW | 42.4 | 39.2 | 37.3 | 55.0 | 36.7 | | PointGrounding | 66.5 | 46.2 | 67.3 | - | - | | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini | |-----------------------------|--------------|------------|------------|---------------|-------------| | Video-MME w/o sub | 64.3 | 62.0 | 63.9 | 65.1 | 64.8 | | Video-MME w sub | 72.4 | 68.6 | 67.9 | 71.6 | - | | MVBench | 70.3 | 68.7 | 67.2 | 69.6 | - | | EgoSchema test | 68.6 | 61.4 | 63.2 | 65.0 | - | SEED test-zh | test-en | test-hard Seed-TTSICL 1.11 | 2.24 | 7.58 SEED test-zh | test-en | test-hard Seed-TTSICL 0.796 | 0.762 | 0.776 | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B | |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------| | MMLU-Pro | 47.0 | 40.4 | 56.3 | 43.7 | 44.1 | 48.3 | 52.1 | | MMLU-redux | 71.0 | 60.9 | 75.4 | 64.4 | 67.3 | 67.2 | 72.8 | | LiveBench 0831 | 29.6 | 22.3 | 35.9 | 26.8 | 29.2 | 26.7 | 30.6 | | GPQA | 30.8 | 34.3 | 36.4 | 30.3 | 34.3 | 32.8 | 32.8 | | MATH | 71.5 | 63.6 | 75.5 | 65.9 | 52.9 | 51.9 | 44.3 | | GSM8K | 88.7 | 82.6 | 91.6 | 86.7 | 85.7 | 84.5 | 76.7 | | HumanEval | 78.7 | 70.7 | 84.8 | 74.4 | 79.9 | 72.6 | 68.9 | | MBPP | 73.2 | 70.4 | 79.2 | 72.7 | 67.2 | 69.6 | 74.9 | | MultiPL-E | 65.8 | 57.6 | 70.4 | 60.2 | 59.1 | 50.7 | 53.4 | | LiveCodeBench 2305-2409 | 24.6 | 16.5 | 28.7 | 19.9 | 23.9 | 8.3 | 18.9 | Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of audio and visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved audio, images and videos. You can install it using the following command and make sure your system has `ffmpeg` installed: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenomniutils`: |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video | |--------------|-----------| ------------- | ------------- | ------------------ | | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend | | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB | | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend | | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB | Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attnimplementation="flashattention2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource here. Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when `returnaudio=False` is set. Here is an example. Prompt for audio output If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected. Use audio in video In the process of multimodal interaction, the videos provided by users are often accompanied by audio (such as questions about the content in the video, or sounds generated by certain events in the video). This information is conducive to the model providing a better interactive experience. So we provide the following options for users to decide whether to use audio in video. It is worth noting that during a multi-round conversation, the `useaudioinvideo` parameter in these places must be set to the same, otherwise unexpected results will occur. The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disabletalker()` after init the model. This option will save about `~2GB` of GPU memory but the `returnaudio` option for `generate` function will only allow to be set at `False`. In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `returnaudio` is set to `False`, the model will only return text outputs to get text responses faster. Change voice type of output audio Qwen2.5-Omni supports the ability to change the voice of the output audio. The `"Qwen/Qwen2.5-Omni-3B"` checkpoint support two voice types as follow: | Voice Type | Gender | Description | |------------|--------|-------------| | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.| | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.| Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`. First, make sure to install the latest version of Flash Attention 2: Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`. To load and run a model using FlashAttention-2, add `attnimplementation="flashattention2"` when loading the model: If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

NaNK
5,014
25

Qwen2.5-7B-unsloth-bnb-4bit

NaNK
license:apache-2.0
4,953
1

Llama-3.2-3B-unsloth-bnb-4bit

NaNK
llama
4,906
2

embeddinggemma-300m-GGUF

4,854
34

Llama-3.1-8B-unsloth-bnb-4bit

NaNK
llama
4,834
4

grok-2-GGUF

Learn how to run Grok 2 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Use `--jinja` for `llama.cpp`. You must use PR 15539. For example use the code below: - `git clone https://github.com/ggml-org/llama.cpp` - `cd llama.cpp && git fetch origin pull/15539/head:MASTER && git checkout MASTER && cd ..` Utilizes Alvaro's Grok-2 HF compatible tokenizer as provided here This repository contains the weights of Grok 2, a model trained and used at xAI in 2024. - Download the weights. You can replace `/local/grok-2` with any other folder name you prefer. You might encounter some errors during the download. Please retry until the download is successful. If the download succeeds, the folder should contain 42 files and be approximately 500 GB. Install the latest SGLang inference engine (>= v0.5.1) from https://github.com/sgl-project/sglang/ Use the command below to launch an inference server. This checkpoint is TP=8, so you will need 8 GPUs (each with > 40GB of memory). This is a post-trained model, so please use the correct chat template. You should be able to see the model output its name, Grok. The weights are licensed under the Grok 2 Community License Agreement.

NaNK
4,735
46

Qwen3-1.7B-Base-unsloth-bnb-4bit

NaNK
license:apache-2.0
4,714
3

Qwen2-VL-2B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
4,573
7

Kimi-K2-Instruct-0905-GGUF

Learn how to run Kimi-K2 Dynamic GGUFs - Read our Guide! Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - You can now use the latest update of llama.cpp to run the model. - For complete detailed instructions, see our guide: docs.unsloth.ai/basics/kimi-k2 It is recommended to have at least 128GB unified RAM memory to run the small quants. With 16GB VRAM and 256 RAM, expect 5+ tokens/sec. For best results, use any 2-bit XL quant or above. Set the temperature to 0.6 recommended) to reduce repetition and incoherence. 📰&nbsp;&nbsp; Tech Blog &nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp; 📄&nbsp;&nbsp; Paper Kimi K2-Instruct-0905 is the latest, most capable version of Kimi K2. It is a state-of-the-art mixture-of-experts (MoE) language model, featuring 32 billion activated parameters and a total of 1 trillion parameters. Key Features - Enhanced agentic coding intelligence: Kimi K2-Instruct-0905 demonstrates significant improvements in performance on public benchmarks and real-world coding agent tasks. - Improved frontend coding experience: Kimi K2-Instruct-0905 offers advancements in both the aesthetics and practicality of frontend programming. - Extended context length: Kimi K2-Instruct-0905’s context window has been increased from 128k to 256k tokens, providing better support for long-horizon tasks. | | | |:---:|:---:| | Architecture | Mixture-of-Experts (MoE) | | Total Parameters | 1T | | Activated Parameters | 32B | | Number of Layers (Dense layer included) | 61 | | Number of Dense Layers | 1 | | Attention Hidden Dimension | 7168 | | MoE Hidden Dimension (per Expert) | 2048 | | Number of Attention Heads | 64 | | Number of Experts | 384 | | Selected Experts per Token | 8 | | Number of Shared Experts | 1 | | Vocabulary Size | 160K | | Context Length | 256K | | Attention Mechanism | MLA | | Activation Function | SwiGLU | | Benchmark | Metric | K2-Instruct-0905 | K2-Instruct-0711 | Qwen3-Coder-480B-A35B-Instruct | GLM-4.5 | DeepSeek-V3.1 | Claude-Sonnet-4 | Claude-Opus-4 | |------------------------|--------|------------------|------------------|--------|--------|--------|-----------------|---------------| | SWE-Bench verified | ACC | 69.2 ± 0.63 | 65.8 | 69.6 | 64.2 | 66.0 | 72.7 | 72.5 | | SWE-Bench Multilingual | ACC | 55.9 ± 0.72 | 47.3 | 54.7 | 52.7 | 54.5 | 53.3 | - | | Multi-SWE-Bench | ACC | 33.5 ± 0.28 | 31.3 | 32.7 | 31.7 | 29.0 | 35.7 | - | | Terminal-Bench | ACC | 44.5 ± 2.03 | 37.5 | 37.5 | 39.9 | 31.3 | 36.4 | 43.2 | | SWE-Dev | ACC | 66.6 ± 0.72 | 61.9 | 64.7 | 63.2 | 53.3 | 67.1 | - | All K2-Instruct-0905 numbers are reported as mean ± std over five independent, full-test-set runs. Before each run we prune the repository so that every Git object unreachable from the target commit disappears; this guarantees the agent sees only the code that would legitimately be available at that point in history. Except for Terminal-Bench (Terminus-2), every result was produced with our in-house evaluation harness. The harness is derived from SWE-agent, but we clamp the context windows of the Bash and Edit tools and rewrite the system prompt to match the task semantics. All baseline figures denoted with an asterisk () are excerpted directly from their official report or public leaderboard; the remaining metrics were evaluated by us under conditions identical to those used for K2-Instruct-0905. For SWE-Dev we go one step further: we overwrite the original repository files and delete any test file that exercises the functions the agent is expected to generate, eliminating any indirect hints about the desired implementation. 4. Deployment > [!Note] > You can access Kimi K2's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you. > > The Anthropic-compatible API maps temperature by `realtemperature = requesttemperature 0.6` for better compatible with existing applications. Our model checkpoints are stored in the block-fp8 format, you can find it on Huggingface. Currently, Kimi-K2 is recommended to run on the following inference engines: Deployment examples for vLLM and SGLang can be found in the Model Deployment Guide. Once the local inference service is up, you can interact with it through the chat endpoint: > [!NOTE] > The recommended temperature for Kimi-K2-Instruct-0905 is `temperature = 0.6`. > If no special instructions are required, the system prompt above is a good default. Kimi-K2-Instruct-0905 has strong tool-calling capabilities. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them. The following example demonstrates calling a weather tool end-to-end: The `toolcallwithclient` function implements the pipeline from user query to tool execution. This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic. For more information, see the Tool Calling Guide. Both the code repository and the model weights are released under the Modified MIT License. If you have any questions, please reach out at [email protected].

NaNK
4,559
51

Hermes-4-70B

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Hermes 4 70B is a frontier, hybrid-mode reasoning model based on Llama-3.1-70B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-70B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-70B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

NaNK
llama
4,550
0

mistral-7b-instruct-v0.3

NaNK
license:apache-2.0
4,536
9

Qwen3-1.7B-bnb-4bit

NaNK
4,520
3

phi-4-GGUF

See our collection for versions of Phi-4 including GGUF, 4-bit & more formats. unsloth/phi-4-GGUF We have converted Phi-4 to Llama's architecture for improved ease of use, better fine-tuning, and greater accuracy. Also contains Unsloth's Phi-4 bugfixes. Finetune Phi-4, Llama 3.3 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab notebook for Phi-4 here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi4-Conversational.ipynb All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Phi-4 | ▶️ Start on Colab | 2x faster | 50% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2 VL (7B) | ▶️ Start on Colab-Vision.ipynb) | 1.8x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Llama-3.1 (8B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Gemma 2 (9B) | ▶️ Start on Colab-Alpaca.ipynb) | 2.4x faster | 58% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | - This Llama 3.2 conversational notebook-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook-TextCompletion.ipynb) is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. | | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | `phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. `phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures | | Architecture | 14B parameters, dense decoder-only Transformer model | | Inputs | Text, best suited for prompts in the chat format | | Context length | 16K tokens | | GPUs | 1920 H100-80G | | Training time | 21 days | | Training data | 9.8T tokens | | Outputs | Generated text in response to input | | Dates | October 2024 – November 2024 | | Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data | | Release date | December 12, 2024 | | License | MIT | | | | |-------------------------------|-------------------------------------------------------------------------| | Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. | | Out-of-Scope Use Cases | Our models is not specifically designed or evaluated for all downstream purposes, thus: 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. | Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from: 1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code. 2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.). 4. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge. We evaluated `phi-4` using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically: MMLU: Popular aggregated dataset for multitask language understanding. `phi-4` has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories. Prior to release, `phi-4` followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by `phi-4` in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks. Please refer to the technical report for more details on safety alignment. To understand the capabilities, we compare `phi-4` with a set of models over OpenAI’s SimpleEval benchmark. At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance: | Category | Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 (14B instruct) | GPT-4o-mini | Llama-3.3 (70B instruct) | Qwen 2.5 (72B instruct) | GPT-4o | |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------| | Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 | | Science | GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 | | Math | MGSM MATH | 80.6 80.4 | 53.5 44.6 | 79.6 75.6 | 86.5 73.0 | 89.1 66.3 | 87.3 80.0 | 90.4 74.6 | | Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 80.4 | 90.6 | | Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 | | Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 | \ These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B. Given the nature of the training data, `phi-4` is best suited for prompts using the chat format as follows: Like other language models, `phi-4` can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. `phi-4` is not intended to support multilingual use. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Limited Scope for Code: Majority of `phi-4` training data is based in Python and uses common packages such as `typing`, `math`, `random`, `collections`, `datetime`, `itertools`. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.

license:mit
4,494
180

granite-4.0-micro-GGUF

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
4,462
10

Qwen3-VL-235B-A22B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-235B-A22B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
4,456
1

DeepSeek-R1-0528-GGUF

license:mit
4,433
192

Mistral-Small-3.2-24B-Instruct-2506-bnb-4bit

NaNK
license:apache-2.0
4,429
10

Qwen2.5-0.5B

NaNK
license:apache-2.0
4,398
10

Qwen2.5-1.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
4,389
5

Qwen3-VL-32B-Thinking-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
4,349
1

Mistral-Small-24B-Instruct-2501-GGUF

NaNK
license:apache-2.0
4,312
26

Qwen3-VL-32B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
4,250
0

Qwen2.5-VL-3B-Instruct-bnb-4bit

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images. Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use. Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of cpaturing event by pinpointing the relevant video segments. Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes. Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc. Dynamic Resolution and Frame Rate Training for Video Understanding: We extend dynamic resolution to the temporal dimension by adopting dynamic FPS sampling, enabling the model to comprehend videos at various sampling rates. Accordingly, we update mRoPE in the time dimension with IDs and absolute time alignment, enabling the model to learn temporal sequence and speed, and ultimately acquire the ability to pinpoint specific moments. We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM. We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 3B Qwen2.5-VL model. For more information, visit our Blog and GitHub. | Benchmark | InternVL2.5-4B |Qwen2-VL-7B |Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MMMU val | 52.3 | 54.1 | 53.1| | MMMU-Pro val | 32.7 | 30.5 | 31.6| | AI2D test | 81.4 | 83.0 | 81.5 | | DocVQA test | 91.6 | 94.5 | 93.9 | | InfoVQA test | 72.1 | 76.5 | 77.1 | | TextVQA val | 76.8 | 84.3 | 79.3| | MMBench-V1.1 test | 79.3 | 80.7 | 77.6 | | MMStar | 58.3 | 60.7 | 55.9 | | MathVista testmini | 60.5 | 58.2 | 62.3 | | MathVision full | 20.9 | 16.3 | 21.2 | Video benchmark | Benchmark | InternVL2.5-4B | Qwen2-VL-7B | Qwen2.5-VL-3B | | :--- | :---: | :---: | :---: | | MVBench | 71.6 | 67.0 | 67.0 | | VideoMME | 63.6/62.3 | 69.0/63.3 | 67.6/61.5 | | MLVU | 48.3 | - | 68.2 | | LVBench | - | - | 43.3 | | MMBench-Video | 1.73 | 1.44 | 1.63 | | EgoSchema | - | - | 64.8 | | PerceptionTest | - | - | 66.9 | | TempCompass | - | - | 64.4 | | LongVideoBench | 55.2 | 55.6 | 54.2 | | CharadesSTA/mIoU | - | - | 38.8 | Agent benchmark | Benchmarks | Qwen2.5-VL-3B | |-------------------------|---------------| | ScreenSpot | 55.5 | | ScreenSpot Pro | 23.9 | | AITZEM | 76.9 | | Android Control HighEM | 63.7 | | Android Control LowEM | 22.2 | | AndroidWorldSR | 90.8 | | MobileMiniWob++SR | 67.9 | Requirements The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command: We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-vl-utils` which will fall back to using torchvision for video processing. However, you can still install decord from source to get decord used when loading video. Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCEQWENVLVIDEOREADER=torchvision` or `FORCEQWENVLVIDEOREADER=decord` if you prefer not to use the default one. | Backend | HTTP | HTTPS | |-------------|------|-------| | torchvision >= 0.19.0 | ✅ | ✅ | | torchvision 🤖 ModelScope We strongly advise users especially those in mainland China to use ModelScope. `snapshotdownload` can help you solve issues concerning downloading checkpoints. For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use. At the same time, for long video inputs, since MRoPE itself is more economical with ids, the maxpositionembeddings can be directly modified to a larger value, such as 64k. If you find our work helpful, feel free to give us a cite.

NaNK
4,229
4

Magistral-Small-2506-GGUF

license:apache-2.0
4,227
93

Qwen3-VL-235B-A22B-Thinking-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-235B-A22B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
4,227
1

gemma-3-27b-it-qat-GGUF

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Read our Guide to see how to Run Gemma 3 correctly. - Fine-tune Gemma 3 (12B) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. - Export your fine-tuned model to GGUF, Ollama, llama.cpp or 🤗HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | GRPO with Gemma 3 (12B) | ▶️ Start on Colab | 2x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Mistral (7B) | ▶️ Start on Colab-Conversational.ipynb) | 2.2x faster | 62% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context of 8192 tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens and 1B with 2 trillion tokens. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation: | Benchmark | Metric | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | Metric | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://goo.gle/Gemma3Report [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
4,213
22

Qwen3-VL-4B-Thinking-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
4,211
0

Hermes-4-70B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Hermes 4 70B is a frontier, hybrid-mode reasoning model based on Llama-3.1-70B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-70B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-70B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

NaNK
Llama-3.1
4,198
22

granite-4.0-h-350m-GGUF

NaNK
license:apache-2.0
4,178
7

granite-4.0-micro-unsloth-bnb-4bit

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
4,178
3

Magistral-Small-2507-GGUF

license:apache-2.0
4,085
19

Devstral-Small-2505-GGUF

license:apache-2.0
4,068
107

Qwen3-0.6B-Base

NaNK
license:apache-2.0
4,037
5

Qwen2.5-72B-Instruct-bnb-4bit

NaNK
4,010
8

Qwen2-7B-Instruct-bnb-4bit

NaNK
license:apache-2.0
3,965
7

Qwen3-VL-2B-Instruct

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
3,940
3

GLM-4.1V-9B-Thinking-GGUF

NaNK
license:mit
3,919
38

Qwen3-VL-2B-Instruct-1M-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
3,902
0

Jan-nano-128k-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Jan-Nano-128k: Empowering deeper research through extended context understanding. [](https://github.com/menloresearch/deep-research) [](https://huggingface.co/Menlo/Jan-nano-128k) [](https://opensource.org/licenses/Apache-2.0) Jan-Nano-128k represents a significant advancement in compact language models for research applications. Building upon the success of Jan-Nano, this enhanced version features a native 128k context window that enables deeper, more comprehensive research capabilities without the performance degradation typically associated with context extension methods. Key Improvements: - 🔍 Research Deeper: Extended context allows for processing entire research papers, lengthy documents, and complex multi-turn conversations - ⚡ Native 128k Window: Built from the ground up to handle long contexts efficiently, maintaining performance across the full context range - 📈 Enhanced Performance: Unlike traditional context extension methods, Jan-Nano-128k shows improved performance with longer contexts This model maintains full compatibility with Model Context Protocol (MCP) servers while dramatically expanding the scope of research tasks it can handle in a single session. Jan-Nano-128k has been rigorously evaluated on the SimpleQA benchmark using our MCP-based methodology, demonstrating superior performance compared to its predecessor: Traditional approaches to extending context length, such as YaRN (Yet another RoPE extensioN), often result in performance degradation as context length increases. Jan-Nano-128k breaks this paradigm: This fundamental difference makes Jan-Nano-128k ideal for research applications requiring deep document analysis, multi-document synthesis, and complex reasoning over large information sets. Jan-Nano-128k is fully supported by Jan - beta build, providing a seamless local AI experience with complete privacy and control. For additional tutorials and community guidance, visit our Discussion Forums. Note: The chat template is included in the tokenizer. For troubleshooting, download the Non-think chat template. - Discussions: HuggingFace Community - Issues: GitHub Repository - Documentation: Official Docs Jan-Nano-128k: Empowering deeper research through extended context understanding.

license:apache-2.0
3,895
32

Qwen3-VL-4B-Instruct-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
3,883
0

Qwen3-32B-unsloth-bnb-4bit

NaNK
license:apache-2.0
3,835
14

granite-4.0-h-micro

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
3,787
2

Qwen2-1.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
3,787
2

Phi-3-mini-4k-instruct

license:mit
3,753
45

Qwen3-8B-Base

NaNK
license:apache-2.0
3,749
4

DeepSeek-V3-0324-GGUF

license:mit
3,689
193

Qwen2.5-Coder-3B-Instruct-bnb-4bit

NaNK
license:apache-2.0
3,663
4

mistral-7b

NaNK
license:apache-2.0
3,655
8

Qwen2.5-Coder-7B

NaNK
license:apache-2.0
3,648
4

SmolLM3-3B-128K-GGUF

NaNK
license:apache-2.0
3,603
30

granite-4.0-h-small

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Small model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Small comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Small model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Small baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
3,492
3

gemma-3-4b-it-bnb-4bit

NaNK
3,460
10

Qwen2-0.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
3,424
6

Llama-3.1-8B

NaNK
llama
3,403
3

gemma-3-1b-it-bnb-4bit

NaNK
3,401
9

Cosmos-Reason1-7B-GGUF

NaNK
3,391
8

Qwen2.5-0.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
3,389
4

gemma-3-27b-it-bnb-4bit

NaNK
3,333
18

Mistral-Small-3.2-24B-Instruct-2506

NaNK
license:apache-2.0
3,225
11

Qwen3-14B-Base-unsloth-bnb-4bit

NaNK
license:apache-2.0
3,222
4

Qwen2-VL-7B-Instruct-bnb-4bit

NaNK
license:apache-2.0
3,098
6

Qwen3-4B-Instruct-2507-bnb-4bit

NaNK
license:apache-2.0
3,093
2

Qwen3-30B-A3B-128K-GGUF

NaNK
license:apache-2.0
3,049
60

Mistral-Small-3.1-24B-Instruct-2503-GGUF

NaNK
license:apache-2.0
3,036
83

DeepSeek-R1-Distill-Llama-8B

NaNK
llama
3,028
106

Phi-4-mini-instruct-bnb-4bit

NaNK
license:mit
3,024
4

Qwen3-30B-A3B

NaNK
3,005
19

Phi-3.5-mini-instruct

llama
2,989
45

Llama-3.2-11B-Vision-unsloth-bnb-4bit

NaNK
mllama
2,987
5

GLM-4-9B-0414-GGUF

NaNK
license:mit
2,947
20

gemma-2-2b-it

NaNK
2,886
17

Qwen2.5-1.5B-bnb-4bit

NaNK
license:apache-2.0
2,883
1

DeepSeek-R1-0528-Qwen3-8B

NaNK
license:mit
2,877
16

Qwen2.5-VL-32B-Instruct-GGUF

NaNK
license:apache-2.0
2,862
6

medgemma-27b-text-it-unsloth-bnb-4bit

NaNK
2,831
5

granite-4.0-h-tiny

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
2,826
4

Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit

NaNK
llama4
2,821
80

medgemma-27b-it-GGUF

NaNK
2,820
25

Phi-4-mini-instruct

license:mit
2,816
21

Seed-Coder-8B-Reasoning-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Introduction We are thrilled to introduce Seed-Coder, a powerful, transparent, and parameter-efficient family of open-source code models at the 8B scale, featuring base, instruct, and reasoning variants. Seed-Coder contributes to promote the evolution of open code models through the following highlights. - Model-centric: Seed-Coder predominantly leverages LLMs instead of hand-crafted rules for code data filtering, minimizing manual effort in pretraining data construction. - Transparent: We openly share detailed insights into our model-centric data pipeline, including methods for curating GitHub data, commits data, and code-related web data. - Powerful: Seed-Coder achieves state-of-the-art performance among open-source models of comparable size across a diverse range of coding tasks. This repo contains the Seed-Coder-8B-Reasoning model, which has the following features: - Type: Causal language models - Training Stage: Pretraining & Post-training - Data Source: Public datasets - Context Length: 65,536 Model Downloads | Model Name | Length | Download | Notes | |---------------------------------------------------------|-----------|------------------------------------|-----------------------| | Seed-Coder-8B-Base | 32K | 🤗 Model | Pretrained on our model-centric code data. | | Seed-Coder-8B-Instruct | 32K | 🤗 Model | Instruction-tuned for alignment with user intent. | | 👉 Seed-Coder-8B-Reasoning | 64K | 🤗 Model | RL trained to boost reasoning capabilities. | | Seed-Coder-8B-Reasoning-bf16 | 64K | 🤗 Model | RL trained to boost reasoning capabilities. | Requirements You will need to install the latest versions of `transformers` and `accelerate`: Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face `pipeline` API: Evaluation Seed-Coder-8B-Reasoning strikes impressive performance on competitive programming, demonstrating that smaller LLMs can also be competent on complex reasoning tasks. Our model surpasses QwQ-32B and DeepSeek-R1 on IOI'2024, and achieves an ELO rating comparable to o1-mini on Codeforces contests. For detailed benchmark performance, please refer to our 📑 Technical Report. This project is licensed under the MIT License. See the LICENSE file for details.

NaNK
llama
2,766
9

gemma-2-9b-it

NaNK
2,685
10

GLM-4.5-GGUF

📍 Use GLM-4.5 API services on Z.ai API Platform (Global) or Zhipu AI Open Platform (Mainland China) . The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications. Both GLM-4.5 and GLM-4.5-Air are hybrid reasoning models that provide two modes: thinking mode for complex reasoning and tool usage, and non-thinking mode for immediate responses. We have open-sourced the base models, hybrid reasoning models, and FP8 versions of the hybrid reasoning models for both GLM-4.5 and GLM-4.5-Air. They are released under the MIT open-source license and can be used commercially and for secondary development. As demonstrated in our comprehensive evaluation across 12 industry-standard benchmarks, GLM-4.5 achieves exceptional performance with a score of 63.2, in the 3rd place among all the proprietary and open-source models. Notably, GLM-4.5-Air delivers competitive results at 59.8 while maintaining superior efficiency. For more eval results, show cases, and technical details, please visit our technical blog. The technical report will be released soon. The model code, tool parser and reasoning parser can be found in the implementation of transformers, vLLM and SGLang.

NaNK
license:mit
2,667
43

Nanonets-OCR-s

2,663
7

gemma-3-12b-it-bnb-4bit

NaNK
2,662
5

Kimi-Dev-72B-GGUF

NaNK
license:mit
2,644
43

Qwen3-VL-32B-Thinking-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
2,625
1

Qwen2-VL-2B-Instruct-bnb-4bit

NaNK
license:apache-2.0
2,624
1

gemma-3-270m-unsloth-bnb-4bit

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Read our Guide to see how to Run Gemma 3 correctly. - Fine-tune Gemma 3 (270M) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma 3 (4B) | ▶️ Start on Colab.ipynb) | 2x faster | 80% less | | Gemma-3n-E4B | ▶️ Start on Colab-Conversational.ipynb) | 2x faster | 60% less | | Gemma-3n-E4B (Audio) | ▶️ Start on Colab-Audio.ipynb) | 2x faster | 60% less | | GRPO with Gemma 3 (1B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Gemma 3 (4B) Vision | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each, for the 4B, 12B, and 27B sizes. - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | n-shot | Gemma 3 PT 270M | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 10-shot | 40.9 | | [BoolQ][boolq] | 0-shot | 61.4 | | [PIQA][piqa] | 0-shot | 67.7 | | [TriviaQA][triviaqa] | 5-shot | 15.4 | | [ARC-c][arc] | 25-shot | 29.0 | | [ARC-e][arc] | 0-shot | 57.7 | | [WinoGrande][winogrande] | 5-shot | 52.0 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [triviaqa]: https://arxiv.org/abs/1705.03551 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 | Benchmark | n-shot | Gemma 3 IT 270m | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 0-shot | 37.7 | | [PIQA][piqa] | 0-shot | 66.2 | | [ARC-c][arc] | 0-shot | 28.2 | | [WinoGrande][winogrande] | 0-shot | 52.3 | | [BIG-Bench Hard][bbh] | few-shot | 26.7 | | [IF Eval][ifeval] | 0-shot | 51.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [piqa]: https://arxiv.org/abs/1911.11641 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [bbh]: https://paperswithcode.com/dataset/bbh [ifeval]: https://arxiv.org/abs/2311.07911 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [GPQA][gpqa] Diamond | 0-shot | 19.2 | 30.8 | 40.9 | 42.4 | | [SimpleQA][simpleqa] | 0-shot | 2.2 | 4.0 | 6.3 | 10.0 | | [FACTS Grounding][facts-grdg] | - | 36.4 | 70.1 | 75.8 | 74.9 | | [BIG-Bench Hard][bbh] | 0-shot | 39.1 | 72.2 | 85.7 | 87.6 | | [BIG-Bench Extra Hard][bbeh] | 0-shot | 7.2 | 11.0 | 16.3 | 19.3 | | [IFEval][ifeval] | 0-shot | 80.2 | 90.2 | 88.9 | 90.4 | | Benchmark | n-shot | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [gpqa]: https://arxiv.org/abs/2311.12022 [simpleqa]: https://arxiv.org/abs/2411.04368 [facts-grdg]: https://goo.gle/FACTSpaper [bbeh]: https://github.com/google-deepmind/bbeh [ifeval]: https://arxiv.org/abs/2311.07911 [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] (Pro) | 0-shot | 14.7 | 43.6 | 60.6 | 67.5 | | [LiveCodeBench][lcb] | 0-shot | 1.9 | 12.6 | 24.6 | 29.7 | | [Bird-SQL][bird-sql] (dev) | - | 6.4 | 36.3 | 47.9 | 54.4 | | [Math][math] | 0-shot | 48.0 | 75.6 | 83.8 | 89.0 | | HiddenMath | 0-shot | 15.8 | 43.0 | 54.5 | 60.3 | | [MBPP][mbpp] | 3-shot | 35.2 | 63.2 | 73.0 | 74.4 | | [HumanEval][humaneval] | 0-shot | 41.5 | 71.3 | 85.4 | 87.8 | | [Natural2Code][nat2code] | 0-shot | 56.0 | 70.3 | 80.7 | 84.5 | | [GSM8K][gsm8k] | 0-shot | 62.8 | 89.2 | 94.4 | 95.9 | | Benchmark | n-shot | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [bird-sql]: https://arxiv.org/abs/2305.03111 [nat2code]: https://arxiv.org/abs/2405.04520 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [Global-MMLU-Lite][global-mmlu-lite] | 0-shot | 34.2 | 54.5 | 69.5 | 75.1 | | [ECLeKTic][eclektic] | 0-shot | 1.4 | 4.6 | 10.3 | 16.7 | | [WMT24++][wmt24pp] | 0-shot | 35.9 | 46.8 | 51.6 | 53.4 | | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |-----------------------------------|:-------------:|:--------------:|:--------------:| | [MMMU][mmmu] (val) | 48.8 | 59.6 | 64.9 | | [DocVQA][docvqa] | 75.8 | 87.1 | 86.6 | | [InfoVQA][info-vqa] | 50.0 | 64.9 | 70.6 | | [TextVQA][textvqa] | 57.8 | 67.7 | 65.1 | | [AI2D][ai2d] | 74.8 | 84.2 | 84.5 | | [ChartQA][chartqa] | 68.8 | 75.7 | 78.0 | | [VQAv2][vqav2] (val) | 62.4 | 71.6 | 71.0 | | [MathVista][mathvista] (testmini) | 50.0 | 62.9 | 67.6 | | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ [mathvista]: https://arxiv.org/abs/2310.02255 Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://arxiv.org/abs/2503.19786 [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

NaNK
2,588
1

mistral-7b-v0.3

NaNK
license:apache-2.0
2,571
9

Qwen2-VL-2B-Instruct

NaNK
license:apache-2.0
2,544
5

llama-3-8b

NaNK
llama
2,494
54

Qwen2-0.5B-Instruct

NaNK
license:apache-2.0
2,480
3

Qwen3-14B-bnb-4bit

NaNK
license:apache-2.0
2,465
3

Qwen3-14B-128K-GGUF

NaNK
license:apache-2.0
2,464
21

granite-4.0-1b-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model Summary: Granite-4.0-1B is a lightweight instruct model finetuned from Granite-4.0-1B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-1B model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-1B comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-1B model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-1B baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
2,451
1

Qwen2.5-Coder-14B-Instruct-bnb-4bit

NaNK
license:apache-2.0
2,445
5

DeepSeek-R1-Distill-Qwen-14B-unsloth-bnb-4bit

NaNK
license:apache-2.0
2,441
29

granite-4.0-h-micro-unsloth-bnb-4bit

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
2,429
2

Qwen3-VL-32B-Instruct-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
2,394
2

Ling-1T-GGUF

license:mit
2,380
5

gemma-3n-E4B-unsloth-bnb-4bit

NaNK
2,364
4

gemma-2b

NaNK
license:apache-2.0
2,334
5

Qwen3-30B-A3B-bnb-4bit

NaNK
2,312
18

gemma-2b-bnb-4bit

NaNK
license:apache-2.0
2,303
16

Qwen2.5-3B-unsloth-bnb-4bit

NaNK
license:apache-2.0
2,301
0

gemma-2-2b

NaNK
2,246
8

gemma-3-1b-pt

NaNK
2,246
6

gemma-2b-it-bnb-4bit

NaNK
license:apache-2.0
2,237
20

Magistral-Small-2509-unsloth-bnb-4bit

Learn to run Magistral 1.2 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves SOTA performance in model quantization. Read our in-depth guide about Magistral 1.2: docs.unsloth.ai/basics/magistral - Fine-tune Magistral 1.2 for free using our Kaggle notebook here-Reasoning-Conversational.ipynb&accelerator=nvidiaTeslaT4)! - View the rest of our notebooks in our docs here. Building upon Mistral Small 3.2 (2506), with added reasoning capabilities, undergoing SFT from Magistral Medium traces and RL on top, it's a small, efficient reasoning model with 24B parameters. Magistral Small can be deployed locally, fitting within a single RTX 4090 or a 32GB RAM MacBook once quantized. - Multimodality: The model now has a vision encoder and can take multimodal inputs, extending its reasoning capabilities to vision. - Performance upgrade: Magistral Small 1.2 should give you significatively better performance than Magistral Small 1.1 as seen in the benchmark results. - Better tone and persona: You should experiment better LaTeX and Markdown formatting, and shorter answers on easy general prompts. - Finite generation: The model is less likely to enter infinite generation loops. - Special think tokens: [THINK] and [/THINK] special tokens encapsulate the reasoning content in a thinking chunk. This makes it easier to parse the reasoning trace and prevents confusion when the '[THINK]' token is given as a string in the prompt. - Reasoning prompt: The reasoning prompt is given in the system prompt. - Reasoning: Capable of long chains of reasoning traces before providing an answer. - Multilingual: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, and Farsi. - Vision: Vision capabilities enable the model to analyze images and reason based on visual content in addition to text. - Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes. - Context Window: A 128k context window. Performance might degrade past 40k but Magistral should still give good results. Hence we recommend to leave the maximum model length to 128k and only lower if you encounter low performance. | Model | AIME24 pass@1 | AIME25 pass@1 | GPQA Diamond | Livecodebench (v5) | |--------------------------|---------------|---------------|--------------|--------------------| | Magistral Medium 1.2 | 91.82% | 83.48% | 76.26% | 75.00% | | Magistral Medium 1.1 | 72.03% | 60.99% | 71.46% | 59.35% | | Magistral Medium 1.0 | 73.59% | 64.95% | 70.83% | 59.36% | | Magistral Small 1.2 | 86.14% | 77.34% | 70.07% | 70.88% | | Magistral Small 1.1 | 70.52% | 62.03% | 65.78% | 59.17% | | Magistral Small 1.0 | 70.68% | 62.76% | 68.18% | 55.84% | Please make sure to use: - `topp`: 0.95 - `temperature`: 0.7 - `maxtokens`: 131072 We highly recommend including the following system prompt for the best results, you can edit and customise it if needed for your specific use case. The `[THINK]` and `[/THINK]` are special tokens that must be encoded as such. Please make sure to use mistral-common as the source of truth. Find below examples from libraries supporting `mistral-common`. We invite you to choose, depending on your use case and requirements, between keeping reasoning traces during multi-turn interactions or keeping only the final assistant response. Make sure you install the latest `Transformers` version:

NaNK
license:apache-2.0
2,223
5

Qwen2.5-Coder-0.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
2,200
4

Qwen3-4B-bnb-4bit

NaNK
2,199
2

whisper-large-v3-turbo

license:mit
2,163
7

gemma-3-4b-pt

NaNK
2,156
3

DeepSeek-R1-Distill-Qwen-7B-unsloth-bnb-4bit

NaNK
license:apache-2.0
2,151
24

llama-2-7b-chat-bnb-4bit

NaNK
llama
2,137
4

LFM2-1.2B

NaNK
2,108
11

Phi-4-reasoning-GGUF

> [!NOTE] > You must use `--jinja` in llama.cpp to enable reasoning. Otherwise no token will be provided. > See our collection for all versions of Phi-4 including GGUF, 4-bit & 16-bit formats. Learn to run Phi-4 reasoning correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Phi-4 (14B) for free using our Google Colab notebook here! - Read our Blog about Phi-4 support with our bug fixes: unsloth.ai/blog/phi4 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Phi-4 (14B) | ▶️ Start on Colab | 2x faster | 50% less | | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Phi-4 (14B) | ▶️ Start on Colab-GRPO.ipynb) | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2x faster | 80% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | | | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | Phi-4-reasoning is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. The supervised fine-tuning dataset includes a blend of synthetic prompts and high-quality filtered data from public domain websites, focused on math, science, and coding skills as well as alignment data for safety and Responsible AI. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. | | Architecture | Base model same as previously released Phi-4, 14B parameters, dense decoder-only Transformer model | | Inputs | Text, best suited for prompts in the chat format | | Context length | 32k tokens | | GPUs | 32 H100-80G | | Training time | 2.5 days | | Training data | 16B tokens, ~8.3B unique tokens | | Outputs | Generated text in response to the input. Model responses have two sections, namely, a reasoning chain-of-thought block followed by a summarization block | | Dates | January 2025 – April 2025 | | Status | Static model trained on an offline dataset with cutoff dates of March 2025 and earlier for publicly available data | | Release date | April 30, 2025 | | License | MIT | | | | |-------------------------------|-------------------------------------------------------------------------| | Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. | | Out-of-Scope Use Cases | This model is designed and tested for math reasoning only. Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. Review the Responsible AI Considerations section below for further guidance when choosing a use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. | Our training data is a mixture of Q&A, chat format data in math, science, and coding. The chat prompts are sourced from filtered high-quality web data and optionally rewritten and processed through a synthetic data generation pipeline. We further include data to improve truthfulness and safety. We evaluated Phi-4-reasoning using the open-source Eureka evaluation suite and our own internal benchmarks to understand the model's capabilities. More specifically, we evaluate our model on: AIME 2025, 2024, 2023, and 2022: Math olympiad questions. GPQA-Diamond: Complex, graduate-level science questions. OmniMath: Collection of over 4000 olympiad-level math problems with human annotation. LiveCodeBench: Code generation benchmark gathered from competitive coding contests. 3SAT (3-literal Satisfiability Problem) and TSP (Traveling Salesman Problem): Algorithmic problem solving. FlenQA: Impact of prompt length on model performance. MMLU-Pro: Popular aggregated dataset for multitask language understanding. Phi-4-reasoning has adopted a robust safety post-training approach via supervised fine-tuning (SFT). This approach leverages a variety of both open-source and in-house generated synthetic prompts, with LLM-generated responses that adhere to rigorous Microsoft safety guidelines, e.g., User Understanding and Clarity, Security and Ethical Guidelines, Limitations, Disclaimers and Knowledge Scope, Handling Complex and Sensitive Topics, Safety and Respectful Engagement, Confidentiality of Guidelines and Confidentiality of Chain-of-Thoughts. Prior to release, Phi-4-reasoning followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by Phi-4-reasoning in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model's safety training including grounded-ness, jailbreaks, harmful content like hate and unfairness, violence, sexual content, or self-harm, and copyright violations for protected material. We further evaluate models on Toxigen, a benchmark designed to measure bias and toxicity targeted towards minority groups. Please refer to the technical report for more details on safety alignment. At the high-level overview of the model quality on representative benchmarks. For the tables below, higher numbers indicate better performance: | | AIME 24 | AIME 25 | OmniMath | GPQA-D | LiveCodeBench (8/1/24–2/1/25) | |-----------------------------|-------------|-------------|-------------|------------|-------------------------------| | Phi-4-reasoning | 75.3 | 62.9 | 76.6 | 65.8 | 53.8 | | Phi-4-reasoning-plus | 81.3 | 78.0 | 81.9 | 68.9 | 53.1 | | OpenThinker2-32B | 58.0 | 58.0 | — | 64.1 | — | | QwQ 32B | 79.5 | 65.8 | — | 59.5 | 63.4 | | EXAONE-Deep-32B | 72.1 | 65.8 | — | 66.1 | 59.5 | | DeepSeek-R1-Distill-70B | 69.3 | 51.5 | 63.4 | 66.2 | 57.5 | | DeepSeek-R1 | 78.7 | 70.4 | 85.0 | 73.0 | 62.8 | | o1-mini | 63.6 | 54.8 | — | 60.0 | 53.8 | | o1 | 74.6 | 75.3 | 67.5 | 76.7 | 71.0 | | o3-mini | 88.0 | 78.0 | 74.6 | 77.7 | 69.5 | | Claude-3.7-Sonnet | 55.3 | 58.7 | 54.6 | 76.8 | — | | Gemini-2.5-Pro | 92.0 | 86.7 | 61.1 | 84.0 | 69.2 | | | Phi-4 | Phi-4-reasoning | Phi-4-reasoning-plus | o3-mini | GPT-4o | |----------------------------------------|-------|------------------|-------------------|---------|--------| | FlenQA [3K-token subset] | 82.0 | 97.7 | 97.9 | 96.8 | 90.8 | | IFEval Strict | 62.3 | 83.4 | 84.9 | 91.5 | 81.8 | | ArenaHard | 68.1 | 73.3 | 79.0 | 81.9 | 75.6 | | HumanEvalPlus | 83.5 | 92.9 | 92.3 | 94.0| 88.0 | | MMLUPro | 71.5 | 74.3 | 76.0 | 79.4 | 73.0 | | Kitab No Context - Precision With Context - Precision No Context - Recall With Context - Recall | 19.3 88.5 8.2 68.1 | 23.2 91.5 4.9 74.8 | 27.6 93.6 6.3 75.4 | 37.9 94.0 4.2 76.1 | 53.7 84.7 20.3 69.2 | | Toxigen Discriminative Toxic category Neutral category | 72.6 90.0 | 86.7 84.7 | 77.3 90.5 | 85.4 88.7 | 87.6 85.1 | | PhiBench 2.21 | 58.2 | 70.6 | 74.2 | 78.0| 72.4 | Overall, Phi-4-reasoning, with only 14B parameters, performs well across a wide range of reasoning tasks, outperforming significantly larger open-weight models such as DeepSeek-R1 distilled 70B model and approaching the performance levels of full DeepSeek R1 model. We also test the models on multiple new reasoning benchmarks for algorithmic problem solving and planning, including 3SAT, TSP, and BA-Calendar. These new tasks are nominally out-of-domain for the models as the training process did not intentionally target these skills, but the models still show strong generalization to these tasks. Furthermore, when evaluating performance against standard general abilities benchmarks such as instruction following or non-reasoning tasks, we find that our new models improve significantly from Phi-4, despite the post-training being focused on reasoning skills in specific domains. Inference is better with `temperature=0.8`, `topp=0.95`, and `dosample=True`. For more complex queries, set the maximum number of tokens to 32k to allow for longer chain-of-thought (CoT). Given the nature of the training data, always use ChatML template with the following system prompt for inference: Phi-4-reasoning is also supported out-of-the-box by Ollama, llama.cpp, and any Phi-4 compatible framework. Like other language models, Phi-4-reasoning can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Phi-4-reasoning is not intended to support multilingual use. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Election Information Reliability: The model has an elevated defect rate when responding to election-critical queries, which may result in incorrect or unauthoritative election critical information being presented. We are working to improve the model's performance in this area. Users should verify information related to elections with the election authority in their region. Limited Scope for Code: Majority of Phi-4-reasoning training data is based in Python and uses common packages such as `typing`, `math`, `random`, `collections`, `datetime`, `itertools`. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.

license:mit
2,046
18

Qwen2.5-32B-Instruct

NaNK
license:apache-2.0
2,022
3

Qwen2.5-Coder-32B-Instruct-bnb-4bit

NaNK
license:apache-2.0
2,018
5

DeepSeek-V3-GGUF

license:mit
2,011
135

gemma-3-270m

See our collection for all versions of Gemma 3 including GGUF, 4-bit & 16-bit formats. Read our Guide to see how to Run Gemma 3 correctly. - Fine-tune Gemma 3 (270M) for free using our Google Colab notebook here! - Read our Blog about Gemma 3 support: unsloth.ai/blog/gemma3 - View the rest of our notebooks in our docs here. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma 3 (4B) | ▶️ Start on Colab.ipynb) | 2x faster | 80% less | | Gemma-3n-E4B | ▶️ Start on Colab-Conversational.ipynb) | 2x faster | 60% less | | Gemma-3n-E4B (Audio) | ▶️ Start on Colab-Audio.ipynb) | 2x faster | 60% less | | GRPO with Gemma 3 (1B) | ▶️ Start on Colab-GRPO.ipynb) | 2x faster | 80% less | | Gemma 3 (4B) Vision | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each, for the 4B, 12B, and 27B sizes. - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | n-shot | Gemma 3 PT 270M | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 10-shot | 40.9 | | [BoolQ][boolq] | 0-shot | 61.4 | | [PIQA][piqa] | 0-shot | 67.7 | | [TriviaQA][triviaqa] | 5-shot | 15.4 | | [ARC-c][arc] | 25-shot | 29.0 | | [ARC-e][arc] | 0-shot | 57.7 | | [WinoGrande][winogrande] | 5-shot | 52.0 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [triviaqa]: https://arxiv.org/abs/1705.03551 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 | Benchmark | n-shot | Gemma 3 IT 270m | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 0-shot | 37.7 | | [PIQA][piqa] | 0-shot | 66.2 | | [ARC-c][arc] | 0-shot | 28.2 | | [WinoGrande][winogrande] | 0-shot | 52.3 | | [BIG-Bench Hard][bbh] | few-shot | 26.7 | | [IF Eval][ifeval] | 0-shot | 51.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [piqa]: https://arxiv.org/abs/1911.11641 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [bbh]: https://paperswithcode.com/dataset/bbh [ifeval]: https://arxiv.org/abs/2311.07911 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [GPQA][gpqa] Diamond | 0-shot | 19.2 | 30.8 | 40.9 | 42.4 | | [SimpleQA][simpleqa] | 0-shot | 2.2 | 4.0 | 6.3 | 10.0 | | [FACTS Grounding][facts-grdg] | - | 36.4 | 70.1 | 75.8 | 74.9 | | [BIG-Bench Hard][bbh] | 0-shot | 39.1 | 72.2 | 85.7 | 87.6 | | [BIG-Bench Extra Hard][bbeh] | 0-shot | 7.2 | 11.0 | 16.3 | 19.3 | | [IFEval][ifeval] | 0-shot | 80.2 | 90.2 | 88.9 | 90.4 | | Benchmark | n-shot | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [gpqa]: https://arxiv.org/abs/2311.12022 [simpleqa]: https://arxiv.org/abs/2411.04368 [facts-grdg]: https://goo.gle/FACTSpaper [bbeh]: https://github.com/google-deepmind/bbeh [ifeval]: https://arxiv.org/abs/2311.07911 [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] (Pro) | 0-shot | 14.7 | 43.6 | 60.6 | 67.5 | | [LiveCodeBench][lcb] | 0-shot | 1.9 | 12.6 | 24.6 | 29.7 | | [Bird-SQL][bird-sql] (dev) | - | 6.4 | 36.3 | 47.9 | 54.4 | | [Math][math] | 0-shot | 48.0 | 75.6 | 83.8 | 89.0 | | HiddenMath | 0-shot | 15.8 | 43.0 | 54.5 | 60.3 | | [MBPP][mbpp] | 3-shot | 35.2 | 63.2 | 73.0 | 74.4 | | [HumanEval][humaneval] | 0-shot | 41.5 | 71.3 | 85.4 | 87.8 | | [Natural2Code][nat2code] | 0-shot | 56.0 | 70.3 | 80.7 | 84.5 | | [GSM8K][gsm8k] | 0-shot | 62.8 | 89.2 | 94.4 | 95.9 | | Benchmark | n-shot | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [bird-sql]: https://arxiv.org/abs/2305.03111 [nat2code]: https://arxiv.org/abs/2405.04520 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [Global-MMLU-Lite][global-mmlu-lite] | 0-shot | 34.2 | 54.5 | 69.5 | 75.1 | | [ECLeKTic][eclektic] | 0-shot | 1.4 | 4.6 | 10.3 | 16.7 | | [WMT24++][wmt24pp] | 0-shot | 35.9 | 46.8 | 51.6 | 53.4 | | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |-----------------------------------|:-------------:|:--------------:|:--------------:| | [MMMU][mmmu] (val) | 48.8 | 59.6 | 64.9 | | [DocVQA][docvqa] | 75.8 | 87.1 | 86.6 | | [InfoVQA][info-vqa] | 50.0 | 64.9 | 70.6 | | [TextVQA][textvqa] | 57.8 | 67.7 | 65.1 | | [AI2D][ai2d] | 74.8 | 84.2 | 84.5 | | [ChartQA][chartqa] | 68.8 | 75.7 | 78.0 | | [VQAv2][vqav2] (val) | 62.4 | 71.6 | 71.0 | | [MathVista][mathvista] (testmini) | 50.0 | 62.9 | 67.6 | | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ [mathvista]: https://arxiv.org/abs/2310.02255 Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://arxiv.org/abs/2503.19786 [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805

1,980
1

Qwen2.5-Coder-1.5B-Instruct

NaNK
license:apache-2.0
1,967
5

Qwen3-VL-4B-Thinking-unsloth-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
1,954
0

Qwen2.5-VL-32B-Instruct-unsloth-bnb-4bit

NaNK
license:apache-2.0
1,896
14

JanusCoder-14B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. 💻Github Repo • 🤗Model Collections • 📜Technical Report We introduce JanusCoder and JanusCoderV, a suite of open-source foundational models designed to establish a unified visual-programmatic interface for code intelligence. This model suite is built upon open-source language models (such as Qwen3-8B and 14B) and multimodal models (such as Qwen2.5-VL and InternVL3.5-8B). The JanusCoder series is trained on JANUSCODE-800K—the largest multimodal code corpus to date, generated by an innovative synthesis toolkit, covering everything from standard charts to complex interactive Web UIs and code-driven animations. This enables the models to uniformly handle diverse visual-programmatic tasks, such as generating code from textual instructions, visual inputs, or a combination of both, rather than building specialized models for isolated tasks. JanusCoder excels at flexible content generation (like data visualizations and interactive front-ends) as well as precise, program-driven editing of visual effects and complex animation construction. | Model Name | Description | Download | | --- | --- | --- | | JanusCoder-8B | 8B text model based on Qwen3-8B. | 🤗 Model | | 👉 JanusCoder-14B | 14B text model based on Qwen3-14B. | 🤗 Model | | JanusCoderV-7B | 7B multimodal model based on Qwen2.5-VL-7B. | 🤗 Model | | JanusCoderV-8B | 8B multimodal model based on InternVL3.5-8B. | 🤗 Model | We evaluate the JanusCoder model on various benchmarks that span code interlligence tasks on multiple PLs: | Model | JanusCoder-14B | Qwen3-14B | Qwen2.5-Coder-32B-Instruct | LLaMA3-8B-Instruct | GPT-4o | | --- | --- | --- | --- | --- | --- | | PandasPlotBench (Task) | 86 | 78 | 82 | 69 | 85 | | ArtifactsBench | 41.1 | 36.5 | 35.5 | 36.5 | 37.9 | | DTVBench (Manim) | 8.41 | 6.63 | 9.61 | 4.92 | 10.60 | | DTVBench (Wolfram) | 5.97 | 5.08 | 4.98 | 3.15 | 5.97 | The following provides demo code illustrating how to generate text using JanusCoder-14B. > Please use transformers >= 4.55.0 to ensure the model works normally. Citation 🫶 If you are interested in our work or find the repository / checkpoints / benchmark / data helpful, please consider using the following citation format when referencing our papers:

NaNK
license:apache-2.0
1,873
1

granite-4.0-350m-GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model Summary: Granite-4.0-350M is a lightweight instruct model finetuned from Granite-4.0-350M-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-350M model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-350M comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-350M model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-350M baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
1,845
1

medgemma-4b-it

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Model on Google Cloud Model Garden: MedGemma Model on Hugging Face: MedGemma GitHub repository (supporting code, Colab notebooks, discussions, and issues): MedGemma Quick start notebook: GitHub Fine-tuning notebook: GitHub Concept applications built using MedGemma: Collection Support: See Contact License: The use of MedGemma is governed by the Health AI Developer Foundations terms of use. This section describes the MedGemma model and how to use it. MedGemma is a collection of Gemma 3 variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications. MedGemma currently comes in three variants: a 4B multimodal version and 27B text-only and multimodal versions. Both MedGemma multimodal versions utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. Their LLM components are trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma 4B is available in both pre-trained (suffix: `-pt`) and instruction-tuned (suffix `-it`) versions. The instruction-tuned version is a better starting point for most applications. The pre-trained version is available for those who want to experiment more deeply with the models. MedGemma 27B multimodal has pre-training on medical image, medical record and medical record comprehension tasks. MedGemma 27B text-only has been trained exclusively on medical text. Both models have been optimized for inference-time computation on medical reasoning. This means it has slightly higher performance on some text benchmarks than MedGemma 27B multimodal. Users who want to work with a single model for both medical text, medical record and medical image tasks are better suited for MedGemma 27B multimodal. Those that only need text use-cases may be better served with the text-only variant. Both MedGemma 27B variants are only available in instruction-tuned versions. MedGemma variants have been evaluated on a range of clinically relevant benchmarks to illustrate their baseline performance. These evaluations are based on both open benchmark datasets and curated datasets. Developers can fine-tune MedGemma variants for improved performance. Consult the Intended Use section below for more details. MedGemma is optimized for medical applications that involve a text generation component. For medical image-based applications that do not involve text generation, such as data-efficient classification, zero-shot classification, or content-based or semantic image retrieval, the MedSigLIP image encoder is recommended. MedSigLIP is based on the same image encoder that powers MedGemma. Please consult the MedGemma Technical Report for more details. Below are some example code snippets to help you quickly get started running the model locally on GPU. If you want to use the model at scale, we recommend that you create a production version using Model Garden. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0. See the following Colab notebooks for examples of how to use MedGemma: To give the model a quick try, running it locally with weights from Hugging Face, see Quick start notebook in Colab. Note that you will need to use Colab Enterprise to obtain adequate GPU resources to run either 27B model without quantization. For an example of fine-tuning the 4B model, see the Fine-tuning notebook in Colab. The 27B models can be fine tuned in a similar manner but will require more time and compute resources than the 4B model. The MedGemma model is built based on Gemma 3 and uses the same decoder-only transformer architecture as Gemma 3\. To read more about the architecture, consult the Gemma 3 model card. Model type: Decoder-only Transformer architecture, see the Gemma 3 Technical Report Input Modalities: Text, vision Output Modality: Text only Attention mechanism: Grouped-query attention (GQA) Context length: Supports long context, at least 128K tokens Key publication: https://arxiv.org/abs/2507.05201 Model created: July 9, 2025 When using this model, please cite: Sellergren et al. "MedGemma Technical Report." arXiv preprint arXiv:2507.05201 (2025). Text string, such as a question or prompt Images, normalized to 896 x 896 resolution and encoded to 256 tokens each Total input length of 128K tokens Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document Total output length of 8192 tokens MedGemma was evaluated across a range of different multimodal classification, report generation, visual question answering, and text-based tasks. The multimodal performance of MedGemma 4B and 27B multimodal was evaluated across a range of benchmarks, focusing on radiology, dermatology, histopathology, ophthalmology, and multimodal clinical reasoning. MedGemma 4B outperforms the base Gemma 3 4B model across all tested multimodal health benchmarks. | Task and metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | Medical image classification | | | | MIMIC CXR\\ \- macro F1 for top 5 conditions | 81.2 | 88.9 | | CheXpert CXR \- macro F1 for top 5 conditions | 32.6 | 48.1 | | CXR14 \- macro F1 for 3 conditions | 32.0 | 50.1 | | PathMCQA\ (histopathology, internal\\) \- Accuracy | 37.1 | 69.8 | | US-DermMCQA\ \- Accuracy | 52.5 | 71.8 | | EyePACS\ (fundus, internal) \- Accuracy | 14.4 | 64.9 | | Visual question answering | | | | SLAKE (radiology) \- Tokenized F1 | 40.2 | 72.3 | | VQA-RAD\\\ (radiology) \- Tokenized F1 | 33.6 | 49.9 | | Knowledge and reasoning | | | | | | MedXpertQA (text \+ multimodal questions) \- Accuracy | 16.4 | 18.8 | Internal datasets. US-DermMCQA is described in Liu (2020, Nature medicine), presented as a 4-way MCQ per example for skin condition classification. PathMCQA is based on multiple datasets, presented as 3-9 way MCQ per example for identification, grading, and subtype for breast, cervical, and prostate cancer. EyePACS is a dataset of fundus images with classification labels based on 5-level diabetic retinopathy severity (None, Mild, Moderate, Severe, Proliferative). More details in the MedGemma Technical Report. Based on radiologist adjudicated labels, described in Yang (2024, arXiv) Section A.1.1. Based on "balanced split," described in Yang (2024, arXiv). MedGemma chest X-ray (CXR) report generation performance was evaluated on MIMIC-CXR using the RadGraph F1 metric. We compare the MedGemma pre-trained checkpoint with our previous best model for CXR report generation, PaliGemma 2. | Metric | MedGemma 4B (pre-trained) | MedGemma 4B (tuned for CXR)| PaliGemma 2 3B (tuned for CXR) | PaliGemma 2 10B (tuned for CXR) | | :---- | :---- | :---- | :---- | :---- | | MIMIC CXR \- RadGraph F1 | 29.5 | 30.3 |28.8 | 29.5 | The instruction-tuned versions of MedGemma 4B and MedGemma 27B achieve lower scores (21.9 and 21.3, respectively) due to the differences in reporting style compared to the MIMIC ground truth reports. Further fine-tuning on MIMIC reports enables users to achieve improved performance, as shown by the improved performance of the MedGemma 4B model that was tuned for CXR. MedGemma 4B and text-only MedGemma 27B were evaluated across a range of text-only benchmarks for medical knowledge and reasoning. The MedGemma models outperform their respective base Gemma models across all tested text-only health benchmarks. | Metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | MedQA (4-op) | 50.7 | 64.4 | | MedMCQA | 45.4 | 55.7 | | PubMedQA | 68.4 | 73.4 | | MMLU Med | 67.2 | 70.0 | | MedXpertQA (text only) | 11.6 | 14.2 | | AfriMed-QA (25 question test set) | 48.0 | 52.0 | For all MedGemma 27B results, test-time scaling is used to improve performance. All models were evaluated on a question answer dataset from synthetic FHIR data to answer questions about patient records. MedGemma 27B multimodal's FHIR-specific training gives it significant improvement over other MedGemma and Gemma models. | Metric | Gemma 3 4B | MedGemma 4B | | :---- | :---- | :---- | | EHRQA | 70.9 | 67.6 | Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma 4B utilizes a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Its LLM component is trained on a diverse set of medical data, including medical text relevant to radiology images, chest-x rays, histopathology patches, ophthalmology images and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 5 different tasks and 6 medical image modalities. These include both open benchmark datasets and curated datasets, with a focus on expert human evaluations for tasks like CXR report generation and radiology VQA. Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: Child safety: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation. Content safety: Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech. Representational harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies. General medical harms: Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts. The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma multimodal variants utilize a SigLIP image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Their LLM component is trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images. MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 6 different tasks and 4 medical image modalities. These benchmarks include both open and internal datasets. MedGemma utilizes a combination of public and private datasets. This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), ChestImaGenome: Set of bounding boxes linking image findings with anatomical regions for MIMIC-CXR (MedGemma 27B multimodal only), SLAKE (multimodal medical images and questions), PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays). Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next). MIMIC-CXR: MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC). Slake-VQA: The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital. PAD-UFES-20: Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD). SCIN: A collaboration between Google Health and Stanford Medicine. TCGA (The Cancer Genome Atlas): A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC) CAMELYON: The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands. PMC-OA (PubMed Central Open Access Subset): Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH. MedQA: This dataset was created by a team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits Mendeley Digital Knee X-Ray: This dataset is from Rani Channamma University, and is hosted on Mendeley Data. AfriMed-QA: This data was developed and led by multiple collaborating organizations and researchers include key contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of Technology, and MasakhaneNLP. VQA-RAD: This dataset was created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman and their affiliated institutions (the US National Library of Medicine and National Institutes of Health) Chest ImaGenome: IBM Research. MedExpQA: This dataset was created by researchers at the HiTZ Center (Basque Center for Language Technology and Artificial Intelligence). MedXpertQA: This dataset was developed by researchers at Tsinghua University (Beijing, China) and Shanghai Artificial Intelligence Laboratory (Shanghai, China). HealthSearchQA: This dataset consists of consisting of 3,173 commonly searched consumer questions In addition to the public datasets listed above, MedGemma was also trained on de-identified, licensed datasets or datasets collected internally at Google from consented participants. Radiology dataset 1: De-identified dataset of different CT studies across body parts from a US-based radiology outpatient diagnostic center network. Ophthalmology dataset 1 (EyePACS): De-identified dataset of fundus images from diabetic retinopathy screening. Dermatology dataset 1: De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia. Dermatology dataset 2: De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia. Dermatology dataset 3: De-identified dataset of non-diseased skin images from an internal data collection effort. Pathology dataset 1: De-identified dataset of histopathology H\&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes. Pathology dataset 2: De-identified dataset of lung histopathology H\&E and IHC whole slide images created by a commercial biobank in the United States. Pathology dataset 3: De-identified dataset of prostate and lymph node H\&E and IHC histopathology whole slide images created by a contract research organization in the United States. Pathology dataset 4: De-identified dataset of histopathology whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H\&E. EHR dataset 1: Question/answer dataset drawn from synthetic FHIR records created by Synthea. The test set includes 19 unique patients with 200 questions per patient divided into 10 different categories. MIMIC-CXR: Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. https://physionet.org/content/mimic-cxr/2.1.0/ and Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019\. "MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports." Scientific Data 6 (1): 1–8. SLAKE: Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering." http://arxiv.org/abs/2102.09542. PAD-UEFS-20: Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones." Data in brief 32 (2020): 106221\. SCIN: Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, et al. 2024\. "Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements." JAMA Network Open 7 (11): e2446615–e2446615. TCGA: The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. CAMELYON16: Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. 2017\. "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer." JAMA 318 (22): 2199–2210. Mendeley Digital Knee X-Ray: Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1 VQA-RAD: Lau, Jason J., Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018\. "A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images." Scientific Data 5 (1): 1–10. Chest ImaGenome: Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. RRID:SCR\007345. https://doi.org/10.13026/wv01-y230 MedQA: Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020\. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." http://arxiv.org/abs/2009.13081. AfrimedQA: Olatunji, Tobi, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, et al. 2024\. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." http://arxiv.org/abs/2411.15640. MedExpQA: Alonso, I., Oronoz, M., & Agerri, R. (2024). MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering. arXiv preprint arXiv:2404.05590. Retrieved from https://arxiv.org/abs/2404.05590 MedXpertQA: Zuo, Yuxin, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025\. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." http://arxiv.org/abs/2501.18362. Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. MedGemma is an open multimodal generative AI model intended to be used as a starting point that enables more efficient development of downstream healthcare applications involving medical text and images. MedGemma is intended for developers in the life sciences and healthcare space. Developers are responsible for training, adapting and making meaningful changes to MedGemma to accomplish their specific intended use. MedGemma models can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions. MedGemma is based on Gemma 3 and has been further trained on medical images and text. MedGemma enables further development in any medical context (image and textual), however the model was pre-trained using chest X-ray, pathology, dermatology, and fundus images. Examples of tasks within MedGemma's training include visual question answering pertaining to medical images, such as radiographs, or providing answers to textual medical questions. Full details of all the tasks MedGemma has been evaluated can be found in the MedGemma Technical Report. Provides strong baseline medical image and text comprehension for models of its size. This strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training. This adaptation may involve prompt engineering, grounding, agentic orchestration or fine-tuning depending on the use case, baseline validation requirements, and desired performance characteristics. MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case. The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Performance benchmarks highlight baseline capabilities on relevant benchmarks, but even for image and text domains that constitute a substantial portion of training data, inaccurate model output is possible. All outputs from MedGemma should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies. MedGemma's multimodal capabilities have been primarily evaluated on single-image tasks. MedGemma has not been evaluated in use cases that involve comprehension of multiple images. MedGemma has not been evaluated or optimized for multi-turn applications. MedGemma's training may make it more sensitive to the specific prompt used than Gemma 3\. When adapting MedGemma developer should consider the following: Bias in validation data: As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc). Data contamination concerns: When evaluating the generalization capabilities of a large model like MedGemma in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedGemma on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk. May 20, 2025: Initial Release July 9, 2025 Bug Fix: Fixed the subtle degradation in the multimodal performance. The issue was due to a missing end-of-image token in the model vocabulary, impacting combined text-and-image tasks. This fix reinstates and correctly maps that token, ensuring text-only tasks remain unaffected while restoring multimodal performance.

NaNK
1,808
6

Mistral-Nemo-Instruct-2407

license:apache-2.0
1,795
11

codellama-7b-bnb-4bit

NaNK
llama
1,789
9

Qwen2.5-0.5B-unsloth-bnb-4bit

NaNK
license:apache-2.0
1,788
2

DeepSeek-R1-Distill-Llama-8B-bnb-4bit

NaNK
llama
1,773
3

cogito-v2-preview-llama-109B-MoE-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. The Cogito v2 LLMs are instruction tuned generative models. All models are released under an open license for commercial use. - Cogito v2 models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models). - The LLMs are trained using Iterated Distillation and Amplification (IDA) - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement. - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts. - In both standard and reasoning modes, Cogito v2-preview models outperform their size equivalent counterparts on common industry benchmarks. - This model is trained in over 30 languages and supports long contexts (upto 10M tokens). Evaluations For detailed evaluations, please refer to the Blog Post. Usage Here is a snippet below for usage with Transformers: Implementing extended thinking - By default, the model will answer in the standard mode. - To enable thinking, you can do any one of the two methods: - Set `enablethinking=True` while applying the chat template. - Add a specific system prompt, along with prefilling the response with "\ \n". NOTE: Unlike Cogito v1 models, we initiate the response with "\ \n" at the beginning of every output when reasoning is enabled. This is because hybrid models can be brittle at times ( \n" ensures that the model does indeed respect thinking. Method 1 - Set enablethinking=True in the tokenizer If you are using Huggingface tokenizers, then you can simply use add the argument `enablethinking=True` to the tokenization (this option is added to the chat template). Method 2 - Add a specific system prompt, along with prefilling the response with "\ \n". To enable thinking using this method, you need to do two parts - Step 1 - Simply use this in the system prompt `systeminstruction = 'Enable deep thinking subroutine.'` If you already have a systeminstruction, then use `systeminstruction = 'Enable deep thinking subroutine.' + '\n\n' + systeminstruction`. Step 2 - Prefil the response with the tokens `" \n"`. Similarly, if you have a system prompt, you can append the `DEEPTHINKINGINSTRUCTION` to the beginning in this way - Tool Calling Cogito models support tool calling (single, parallel, multiple and parallelmultiple) both in standard and extended thinking mode. You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: and then call the tool and append the result, with the `tool` role, like so: After that, you can `generate()` again to let the model use the tool result in the chat: License This repository and the model weights are licensed under the Llama 4 Community License Agreement (Llama models' default license agreement). Contact If you would like to reach out to our team, send an email to [email protected].

NaNK
base_model:deepcogito/cogito-v2-preview-llama-109B-MoE
1,750
9

Qwen2.5-VL-72B-Instruct-GGUF

NaNK
1,750
7

Llama-3_3-Nemotron-Super-49B-v1_5-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Llama-3.3-Nemotron-Super-49B-v1.5 is a significantly upgraded version of Llama-3.3-Nemotron-Super-49B-v1 and is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and agentic tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. Llama-3.3-Nemotron-Super-49B-v1.5 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model’s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. For more information on the NAS approach, please refer to this paper The model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Science, and Tool Calling. Additionally, the model went through multiple stages of Reinforcement Learning (RL) including Reward-aware Preference Optimization (RPO) for chat, Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning, and iterative Direct Preference Optimization (DPO) for Tool Calling capability enhancements. The final checkpoint was achieved after merging several RL and DPO checkpoints. This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: - Llama-3.1-Nemotron-Nano-4B-v1.1 - Llama-3.1-Nemotron-Ultra-253B-v1 GOVERNING TERMS: Your use of this model is governed by the NVIDIA Open Model License. Additional Information: Llama 3.3 Community License Agreement. Built with Llama. Model Dates: Trained between November 2024 and July 2025 Data Freshness: The pretraining data has a cutoff of 2023 per Meta Llama 3.3 70B Use Case: Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. Release Date: - Hugging Face 7/25/2025 via Llama-33-Nemotron-Super-49B-v15 - build.nvidia.com 7/25/2025 Llama-33-Nemotron-Super-49B-v15 [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949) [\[2502.00203\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203) [\[2411.19146\]Puzzle: Distillation-Based NAS for Inference-Optimized LLMs](https://arxiv.org/abs/2411.19146) Architecture Type: Dense decoder-only Transformer model Network Architecture: Llama 3.3 70B Instruct, customized through Neural Architecture Search (NAS) The model is a derivative of Meta’s Llama-3.3-70B-Instruct, using Neural Architecture Search (NAS). The NAS algorithm results in non-standard and non-repetitive blocks. This includes the following: Skip attention: In some blocks, the attention is skipped entirely, or replaced with a single linear layer. Variable FFN: The expansion/compression ratio in the FFN layer is different between blocks. We utilize a block-wise distillation of the reference model, where for each block we create multiple variants providing different tradeoffs of quality vs. computational complexity, discussed in more depth below. We then search over the blocks to create a model which meets the required throughput and memory (optimized for a single H100-80GB GPU) while minimizing the quality degradation. The model then undergoes knowledge distillation (KD), with a focus on English single and multi-turn chat use-cases. The KD step included 40 billion tokens consisting of a mixture of 3 datasets - FineWeb, Buzz-V1.2 and Dolma. Llama-3.3-Nemotron-Super-49B-v1.5 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported. Input - Input Type: Text - Input Format: String - Input Parameters: One-Dimensional (1D) - Other Properties Related to Input: Context length up to 131,072 tokens Output - Output Type: Text - Output Format: String - Output Parameters: One-Dimensional (1D) - Other Properties Related to Output: Context length up to 131,072 tokens Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. Software Integration - Runtime Engine: Transformers - Recommended Hardware Microarchitecture Compatibility: - NVIDIA Ampere - NVIDIA Hopper - Preferred Operating System(s): Linux 1. By default (empty system prompt) the model will respond in reasoning ON mode. Setting `/nothink` in the system prompt will enable reasoning OFF mode. 2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode 3. We recommend using greedy decoding for Reasoning OFF mode You can try this model out through the preview API, using this link: Llama-33-Nemotron-Super-49B-v15. Running a vLLM Server with Tool-call Support To enable tool calling usage with this model, we provide a tool parser in the repository. Here is an example on how to use it: After launching a vLLM server, you can call the server with tool-call support using a Python script like below. A large variety of training data was used for the knowledge distillation phase before post-training pipeline, 3 of which included: FineWeb, Buzz-V1.2, and Dolma. The data for the multi-stage post-training phases for improvements in Code, Math, and Reasoning is a compilation of SFT and RL data that supports improvements of math, code, general reasoning, and instruction following capabilities of the original Llama instruct model. Prompts have been sourced from either public and open corpus or synthetically generated. Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes. We have released our Nemotron-Post-Training-Dataset-v1 to promote openness and transparency in model development and improvement. Data Collection for Training Datasets: Hybrid: Automated, Human, Synthetic Data Labeling for Training Datasets: Hybrid: Automated, Human, Synthetic We used the datasets listed below to evaluate Llama-3.3-Nemotron-Super-49B-v1.5. Data Collection for Evaluation Datasets: - Hybrid: Human. Synthetic Data Labeling for Evaluation Datasets: - Hybrid: Human, Synthetic, Automatic Evaluation Results We evaluate the model using temperature=`0.6`, topp=`0.95`, and 64k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate. | Reasoning Mode | pass@1 (avg. over 4 runs) | |--------------|------------| | Reasoning On | 97.4 | | Reasoning Mode | pass@1 (avg. over 16 runs) | |--------------|------------| | Reasoning On | 87.5 | | Reasoning Mode | pass@1 (avg. over 16 runs) | |--------------|------------| | Reasoning On | 82.71 | | Reasoning Mode | pass@1 (avg. over 4 runs) | |--------------|------------| | Reasoning On | 71.97 | | Reasoning Mode | pass@1 (avg. over 4 runs) | |--------------|------------| | Reasoning On | 73.58 | | Reasoning Mode | pass@1 (avg. over 2 runs) | |--------------|------------| | Reasoning On | 71.75 | | Reasoning Mode | Strict:Instruction | |--------------|------------| | Reasoning On | 88.61 | | Reasoning Mode | pass@1 (avg. over 1 runs) | |--------------|------------| | Reasoning On | 92.0 | | Reasoning Mode | pass@1 (avg. over 1 runs) | |--------------|------------| | Reasoning On | 7.64 | | Reasoning Mode | pass@1 (avg. over 1 runs) | |--------------|------------| | Reasoning On | 79.53 | All evaluations were done using the NeMo-Skills repository. Test Hardware: - 2x NVIDIA H100-80GB - 2x NVIDIA A100-80GB GPUs NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

NaNK
unsloth - llama-3 - pytorch
1,734
9

Devstral-Small-2507-unsloth-bnb-4bit

NaNK
license:apache-2.0
1,720
4

SmolLM3-3B-GGUF

> [!NOTE] > Includes our chat template fixes! If you are using `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. 1. Model Summary 2. How to use 3. Evaluation 4. Training 5. Limitations 6. License SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale. The model is a decoder-only transformer using GQA and NoPE (with 3:1 ratio), it was pretrained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. Post-training included midtraining on 140B reasoning tokens followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO). Key features - Instruct model optimized for hybrid reasoning - Fully open model: open weights + full training details including public data mixture and training configs - Long context: Trained on 64k context and suppots up to 128k tokens using YARN extrapolation - Multilingual: 6 natively supported (English, French, Spanish, German, Italian, and Portuguese) For more details refer to our blog post: https://hf.co/blog/smollm3 The modeling code for SmolLM3 is available in transformers `v4.53.0`, so make sure to upgrade your transformers version. You can also load the model with the latest `vllm` which uses transformers as a backend. >[!TIP] > We recommend setting `temperature=0.6` and `topp=0.95` in the sampling parameters. We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/nothink` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/nothink`. We also provide the option of specifying the whether to use extended thinking through the `enablethinking` kwarg as in the example below. You do not need to set the `/nothink` or `/think` flags through the system prompt if using the kwarg, but keep in mind that the flag in the system prompt overwrites the setting in the kwarg. SmolLM3 supports tool calling! Just pass your list of tools: - Under the argument `xmltools` for standard tool-calling: these tools will be called as JSON blobs within XML tags, like ` {"name": "getweather", "arguments": {"city": "Copenhagen"}} ` - Or under `pythontools`: then the model will call tools like python functions in a ` ` snippet, like ` getweather(city="Copenhagen") ` You can specify custom instruction through the system prompt while controlling whether to use extended thinking. For example, the snippet below shows how to make the model speak like a pirate while enabling extended thinking. For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection (https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23) You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format. You can specify `chattemplatekwargs` such as `enablethinking` and `xmltools` to a deployed model by passing the `chattemplatekwargs` parameter in the API request. In this section, we report the evaluation results of SmolLM3 model. All evaluations are zero-shot unless stated otherwise, and we use lighteval to run them. We highlight the best score in bold and underline the second-best score. No Extended Thinking Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold. | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B | |---------|--------|------------|------------|-------------|------------|----------| | High school math competition | AIME 2025 | 9.3 | 2.9 | 0.3 | 8.0 | 17.1 | | Math problem-solving | GSM-Plus | 72.8 | 74.1 | 59.2 | 68.3 | 82.1 | | Competitive programming | LiveCodeBench v4 | 15.2 | 10.5 | 3.4 | 15.0 | 24.9 | | Graduate-level reasoning | GPQA Diamond | 35.7 | 32.2 | 29.4 | 31.8 | 44.4 | | Instruction following | IFEval | 76.7 | 65.6 | 71.6 | 74.0 | 68.9 | | Alignment | MixEval Hard | 26.9 | 27.6 | 24.9 | 24.3 | 31.6 | | Tool Calling | BFCL| 92.3 | - | 92.3 | 89.5 | 95.0 | | Multilingual Q&A | Global MMLU | 53.5 | 50.54 | 46.8 | 49.5 | 65.1 | Extended Thinking Evaluation results in reasoning mode for SmolLM3 and Qwen3 models: | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B | |---------|--------|------------|------------|----------| | High school math competition | AIME 2025 | 36.7 | 30.7 | 58.8 | | Math problem-solving | GSM-Plus | 83.4 | 79.4 | 88.2 | | Competitive programming | LiveCodeBench v4 | 30.0 | 34.4 | 52.9 | | Graduate-level reasoning | GPQA Diamond | 41.7 | 39.9 | 55.3 | | Instruction following | IFEval | 71.2 | 74.2 | 85.4 | | Alignment | MixEval Hard | 30.8 | 33.9 | 38.0 | | Tool Calling | BFCL | 88.8 | 88.8 | 95.5 | | Multilingual Q&A | Global MMLU | 64.1 | 62.3 | 73.3 | English benchmarks Note: All evaluations are zero-shot unless stated otherwise. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length. | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base | |---------|--------|---------------------|------------|--------------|------------------|---------------| | Reasoning & Commonsense| HellaSwag | 76.15 | 74.19 | 75.52 | 60.52 | 74.37 | | | ARC-CF (Average) | 65.61 | 59.81 | 58.58 | 55.88 | 62.11 | | | Winogrande | 58.88 | 61.41 | 58.72 | 57.06 | 59.59 | | | CommonsenseQA | 55.28 | 49.14 | 60.60 | 48.98 | 52.99 | | Knowledge & Understanding | MMLU-CF (Average) | 44.13 | 42.93 | 41.32 | 39.11 | 47.65 | | | MMLU Pro CF | 19.61 | 16.66 | 16.42 | 18.04 | 24.92 | | | MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 30.39 | 41.07 | | | PIQA | 78.89 | 78.35 | 78.51 | 75.35 | 77.58 | | | OpenBookQA | 40.60 | 40.20 | 42.00 | 36.40 | 42.40 | | | BoolQ | 78.99 | 73.61 | 75.33 | 74.46 | 74.28 | | Math & Code | | | | | | | | Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | 43.29 | 54.87 | | | MBPP+ | 52.91 | 52.11 | 38.88| 59.25 | 63.75 | | | MATH (4-shot) | 46.10 | 40.10 | 7.44 | 41.64 | 51.20 | | | GSM8k (5-shot) | 67.63 | 70.13 | 25.92 | 65.88 | 74.14 | | Long context | | | | | | | | | Ruler 32k | 76.35 | 75.93 | 77.58 | 70.63 | 83.98 | | | Ruler 64k | 67.85 | 64.90 | 72.93 | 57.18 | 60.29 | | | Ruler 128k | 61.03 | 62.23 | 71.30 | 43.03 | 47.23 | | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base | |---------|--------|---------------------|------------|--------------|------------------|---------------| | Main supported languages | | | | | | | | | French| MLMM Hellaswag | 63.94 | 57.47 | 57.66 | 51.26 | 61.00 | | | Belebele | 51.00 | 51.55 | 49.22 |49.44| 55.00 | | | Global MMLU (CF) | 38.37 | 34.22 | 33.71 | 34.94 |41.80 | | | Flores-200 (5-shot) | 62.85| 61.38| 62.89 | 58.68 | 65.76 | | Spanish| MLMM Hellaswag | 65.85 | 58.25 | 59.39 | 52.40 | 61.85 | | | Belebele | 47.00 | 48.88 | 47.00 | 47.56 | 50.33 | | | Global MMLU (CF) | 38.51 | 35.84 | 35.60 | 34.79 |41.22 | | | Flores-200 (5-shot) | 48.25 | 50.00| 44.45 | 46.93 | 50.16 | | German| MLMM Hellaswag | 59.56 | 49.99| 53.19|46.10| 56.43 | | | Belebele | 48.44 | 47.88 | 46.22 | 48.00 | 53.44| | | Global MMLU (CF) | 35.10 | 33.19 | 32.60 | 32.73 |38.70 | | | Flores-200 (5-shot) | 56.60| 50.63| 54.95 | 52.58 | 50.48 | | Italian| MLMM Hellaswag | 62.49 | 53.21 | 54.96 | 48.72 | 58.76 | | | Belebele | 46.44 | 44.77 | 43.88 | 44.00 | 48.78 | 44.88 | | | Global MMLU (CF) | 36.99 | 33.91 | 32.79 | 35.37 |39.26 | | | Flores-200 (5-shot) | 52.65 | 54.87| 48.83 | 48.37 | 49.11 | | Portuguese| MLMM Hellaswag | 63.22 | 57.38 | 56.84 | 50.73 | 59.89 | | | Belebele | 47.67 | 49.22 | 45.00 | 44.00 | 50.00 | 49.00 | | | Global MMLU (CF) | 36.88 | 34.72 | 33.05 | 35.26 |40.66 | | | Flores-200 (5-shot) | 60.93 |57.68| 54.28 | 56.58 | 63.43 | The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information. | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base | |---------|--------|---------------------|------------|--------------|------------------|---------------| | Other supported languages | | | | | | | | | Arabic| Belebele | 40.22 | 44.22 | 45.33 | 42.33 | 51.78 | | | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | 29.37 | 31.85 | | | Flores-200 (5-shot) | 40.22 | 39.44 | 44.43 | 35.82 | 39.76 | | Chinese| Belebele | 43.78 | 44.56 | 49.56 | 48.78 | 53.22 | | | Global MMLU (CF) | 36.16 | 33.79 | 39.57 | 38.56 | 44.55 | | | Flores-200 (5-shot) | 29.17 | 33.21 | 31.89 | 25.70 | 32.50 | | Russian| Belebele | 47.44 | 45.89 | 47.44 | 45.22 | 51.44 | | | Global MMLU (CF) | 36.51 | 32.47 | 34.52 | 34.83 | 38.80 | | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | 54.70 | 60.53 | - Architecture: Transformer decoder - Pretraining tokens: 11T - Precision: bfloat16 - GPUs: 384 H100 - Training Framework: nanotron - Data processing framework: datatrove - Evaluation framework: lighteval - Post-training Framework: TRL Open resources Here is an infographic with all the training details - The datasets used for pretraining can be found in this collection and those used in mid-training and post-training will be uploaded later - The training and evaluation configs and code can be found in the huggingface/smollm repository. SmolLM3 can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.

NaNK
license:apache-2.0
1,702
17

orpheus-3b-0.1-pretrained

NaNK
llama
1,702
2

Qwen3-VL-8B-Thinking

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-8B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
1,675
1

gemma-7b-bnb-4bit

NaNK
license:apache-2.0
1,637
18

Qwen3-32B-128K-GGUF

NaNK
license:apache-2.0
1,636
27

Qwen3-VL-4B-Thinking-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-4B-Thinking. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging face transformers and we advise you to build from source with command: Here we show a code snippet to show you how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
1,619
1

Qwen3-4B-Thinking-2507-bnb-4bit

NaNK
license:apache-2.0
1,617
2

Qwen3-30B-A3B-Base

NaNK
license:apache-2.0
1,613
3

Qwen3-VL-2B-Instruct-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
1,601
0

Hunyuan-A13B-Instruct-GGUF

NaNK
1,596
41

Qwen3-235B-A22B-128K-GGUF

NaNK
license:apache-2.0
1,581
34

granite-4.0-micro

See our collection for all versions of Granite-4.0 including GGUF, 4-bit & 16-bit formats. Learn to run Granite 4.0 correctly - Read our Guide . See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks. - Fine-tune Granite-4.0 for free using our Google Colab notebook - Read our Blog about Granite-4.0 support: https://docs.unsloth.ai/new/ibm-granite-4.0 - View the rest of our notebooks in our docs here. Model Summary: Granite-4.0-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

NaNK
license:apache-2.0
1,542
2

Qwen3-VL-32B-Instruct-bnb-4bit

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-32B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
1,536
1

Qwen3-1.7B-Base-bnb-4bit

NaNK
license:apache-2.0
1,456
0

gemma-3n-E2B-it

NaNK
1,451
2

Llama-4-Maverick-17B-128E-Instruct-GGUF-UD

NaNK
llama4
1,405
3

Llama-4-Scout-17B-16E-Instruct

NaNK
llama4
1,394
54

Llama-OuteTTS-1.0-1B

NaNK
llama
1,382
4

Pixtral-12B-2409-unsloth-bnb-4bit

NaNK
license:apache-2.0
1,342
12

orpheus-3b-0.1-ft-GGUF

NaNK
license:apache-2.0
1,306
8

Qwen2.5-Coder-1.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
1,296
1

Qwen3-30B-A3B-Thinking-2507

See our collection for all versions of Qwen3 including GGUF, 4-bit & 16-bit formats. Learn to run Qwen3-2507 correctly - Read our Guide . Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. - Fine-tune Qwen3 (14B) for free using our Google Colab notebook here! - Read our Blog about Qwen3 support: unsloth.ai/blog/qwen3 - View the rest of our notebooks in our docs here. - Run & export your fine-tuned model to Ollama, llama.cpp or HF. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Qwen3 (14B) | ▶️ Start on Colab | 3x faster | 70% less | | GRPO with Qwen3 (8B) | ▶️ Start on Colab | 3x faster | 80% less | | Llama-3.2 (3B) | ▶️ Start on Colab-Conversational.ipynb) | 2.4x faster | 58% less | | Llama-3.2 (11B vision) | ▶️ Start on Colab-Vision.ipynb) | 2x faster | 60% less | | Qwen2.5 (7B) | ▶️ Start on Colab-Alpaca.ipynb) | 2x faster | 60% less | Over the past three months, we have continued to scale the thinking capability of Qwen3-30B-A3B, improving both the quality and depth of reasoning. We are pleased to introduce Qwen3-30B-A3B-Thinking-2507, featuring the following key enhancements: - Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise. - Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences. - Enhanced 256K long-context understanding capabilities. NOTE: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. Qwen3-30B-A3B-Thinking-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only thinking mode. Meanwhile, specifying `enablethinking=True` is no longer required. Additionally, to enforce model thinking, the default chat template automatically includes ` `. Therefore, it is normal for the model's output to contain only ` ` without an explicit opening ` ` tag. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Gemini2.5-Flash-Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B Thinking | Qwen3-30B-A3B-Thinking-2507 | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 81.9 | 82.8 | 78.5 | 80.9 | | MMLU-Redux | 92.1 | 92.7 | 89.5 | 91.4 | | GPQA | 82.8 | 71.1 | 65.8 | 73.4 | | SuperGPQA | 57.8 | 60.7 | 51.8 | 56.8 | | Reasoning | | | | | | | | AIME25 | 72.0 | 81.5 | 70.9 | 85.0 | | HMMT25 | 64.2 | 62.5 | 49.8 | 71.4 | | LiveBench 20241125 | 74.3 | 77.1 | 74.3 | 76.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 61.2 | 55.7 | 57.4 | 66.0 | | CFEval | 1995 | 2056 | 1940 | 2044 | | OJBench | 23.5 | 25.6 | 20.7 | 25.1 | | Alignment | | | | | | IFEval | 89.8 | 83.4 | 86.5 | 88.9 | | Arena-Hard v2$ | 56.7 | 61.5 | 36.3 | 56.0 | | Creative Writing v3 | 85.0 | 84.6 | 79.1 | 84.4 | | WritingBench | 83.9 | 80.3 | 77.0 | 85.0 | | Agent | | | | | | BFCL-v3 | 68.6 | 70.8 | 69.1 | 72.4 | | TAU1-Retail | 65.2 | 54.8 | 61.7 | 67.8 | | TAU1-Airline | 54.0 | 26.0 | 32.0 | 48.0 | | TAU2-Retail | 66.7 | 40.4 | 34.2 | 58.8 | | TAU2-Airline | 52.0 | 30.0 | 36.0 | 58.0 | | TAU2-Telecom | 31.6 | 21.9 | 22.8 | 26.3 | | Multilingualism | | | | | | MultiIF | 74.4 | 71.9 | 72.2 | 76.4 | | MMLU-ProX | 80.2 | 80.0 | 73.1 | 76.4 | | INCLUDE | 83.9 | 78.7 | 71.9 | 74.4 | | PolyMATH | 49.8 | 54.7 | 46.1 | 52.6 | $ For reproducibility, we report the win rates evaluated by GPT-4.1. \& For highly challenging tasks (including PolyMATH and all reasoning and coding tasks), we use an output length of 81,920 tokens. For all other tasks, we set the output length to 32,768. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) # no opening tag print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-30B-A3B-Thinking-2507 --context-length 262144 --reasoning-parser deepseek-r1 shell vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --max-model-len 262144 --enable-reasoning --reasoning-parser deepseekr1 python from qwenagent.agents import Assistant Define LLM Using Alibaba Cloud Model Studio llmcfg = { 'model': 'qwen3-30b-a3b-thinking-2507', 'modeltype': 'qwendashscope', } Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example, `VLLMUSEMODELSCOPE=true vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507 --served-model-name Qwen3-30B-A3B-Thinking-2507 --tensor-parallel-size 8 --max-model-len 262144`. llmcfg = { 'model': 'Qwen3-30B-A3B-Thinking-2507', # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase without reasoning and tool call parsing 'apikey': 'EMPTY', 'generatecfg': { 'thoughtincontent': True, }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

NaNK
license:apache-2.0
1,295
10

Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

NaNK
license:apache-2.0
1,287
40

Qwen2.5-Coder-14B-Instruct-128K-GGUF

NaNK
license:apache-2.0
1,287
30

Qwen2-0.5B-bnb-4bit

NaNK
license:apache-2.0
1,265
3

QwQ-32B-GGUF

NaNK
license:apache-2.0
1,262
86

Qwen3-0.6B-Base-bnb-4bit

NaNK
license:apache-2.0
1,211
0

gemma-3-270m-it-bnb-4bit

NaNK
1,193
3

Pixtral-12B-2409-bnb-4bit

NaNK
license:apache-2.0
1,190
4

Mistral-Nemo-Base-2407-bnb-4bit

NaNK
license:apache-2.0
1,178
15

Falcon-H1-7B-Instruct-GGUF

NaNK
1,175
4

Jan-nano-GGUF

license:apache-2.0
1,174
9

DeepSeek-V3-0324-GGUF-UD

license:mit
1,168
19

gemma-3-1b-pt-unsloth-bnb-4bit

NaNK
1,119
4

Qwen3-16B-A3B-GGUF

NaNK
license:apache-2.0
1,117
11

Qwen3-14B-Base

NaNK
license:apache-2.0
1,117
1

gemma-1.1-2b-it-bnb-4bit

NaNK
license:apache-2.0
1,099
5

OpenReasoning-Nemotron-32B-GGUF

NaNK
license:cc-by-4.0
1,075
11

DeepSeek-V3.1-BF16

license:mit
1,071
1

Mistral-Nemo-Base-2407

license:apache-2.0
1,051
5

dots.llm1.inst-GGUF

license:mit
1,022
19

Seed-Coder-8B-Instruct-GGUF

NaNK
llama
1,016
5

Qwen2.5-Math-1.5B-Instruct-bnb-4bit

NaNK
license:apache-2.0
1,003
2

LFM2-1.2B-GGUF

> [!NOTE] > Includes our chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of three post-trained checkpoints with 350M, 700M, and 1.2B parameters. They provide the following key features to create AI-powered edge applications: Fast training & inference – LFM2 achieves 3x faster training compared to its previous generation. It also benefits from 2x faster decode and prefill speed on CPU compared to Qwen3. Best performance – LFM2 outperforms similarly-sized models across multiple benchmark categories, including knowledge, mathematics, instruction following, and multilingual capabilities. New architecture – LFM2 is a new hybrid Liquid model with multiplicative gates and short convolutions. Flexible deployment – LFM2 runs efficiently on CPU, GPU, and NPU hardware for flexible deployment on smartphones, laptops, or vehicles. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | Value | | ------------------- | ----------------------------- | | Parameters | 1,170,340,608 | | Layers | 16 (10 conv + 6 attn) | | Context length | 32,768 tokens | | Vocabulary size | 65,536 | | Precision | bfloat16 | | Training budget | 10 trillion tokens | | License | LFM Open License v1.0 | Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Chat template: LFM2 uses a ChatML-like chat template as follows: You can apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 10 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Knowledge distillation using LFM1-7B as teacher model Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` from source (v4.54.0.dev0). You can update or install it with the following command: `pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"`. Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT + LoRA | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter in TRL. | | | DPO | Preference alignment with Direct Preference Optimization (DPO) in TRL. | | LFM2 outperforms similar-sized models across different evaluation categories. | Model | MMLU | GPQA | IFEval | IFBench | GSM8K | MGSM | MMMLU | |-------|------|------|--------|---------|-------|------|-------| | LFM2-350M | 43.43 | 27.46 | 65.12 | 16.41 | 30.1 | 29.52 | 37.99 | | LFM2-700M | 49.9 | 28.48 | 72.23 | 20.56 | 46.4 | 45.36 | 43.28 | | LFM2-1.2B | 55.23 | 31.47 | 74.89 | 20.7 | 58.3 | 55.04 | 46.73 | | Qwen3-0.6B | 44.93 | 22.14 | 64.24 | 19.75 | 36.47 | 41.28 | 30.84 | | Qwen3-1.7B | 59.11 | 27.72 | 73.98 | 21.27 | 51.4 | 66.56 | 46.51 | | Llama-3.2-1B-Instruct | 46.6 | 28.84 | 52.39 | 16.86 | 35.71 | 29.12 | 38.15 | | gemma-3-1b-it | 40.08 | 21.07 | 62.9 | 17.72 | 59.59 | 43.6 | 34.43 | If you are interested in custom solutions with edge deployment, please contact our sales team.

NaNK
999
19

Falcon-H1-3B-Instruct-GGUF

NaNK
996
2

zephyr-sft-bnb-4bit

NaNK
license:apache-2.0
991
5

Qwen2.5-7B-bnb-4bit

NaNK
license:apache-2.0
986
6

gemma-3-12b-it-qat-int4-GGUF

NaNK
977
7

llava-v1.6-mistral-7b-hf-bnb-4bit

NaNK
license:apache-2.0
972
8

gemma-3-27b-it-qat-unsloth-bnb-4bit

NaNK
963
3

Qwen2-VL-7B-Instruct

NaNK
license:apache-2.0
954
6

DeepSeek-R1-Distill-Qwen-7B

NaNK
license:apache-2.0
947
15

DeepSeek-R1-Distill-Qwen-1.5B

NaNK
license:mit
944
15

Qwen2.5-14B

NaNK
license:apache-2.0
936
1

GLM-Z1-9B-0414-GGUF

NaNK
license:mit
932
9

Qwen3-4B-128K-GGUF

NaNK
license:apache-2.0
930
25

Qwen3-8B-FP8

NaNK
license:apache-2.0
929
0

ERNIE-4.5-0.3B-PT-GGUF

NaNK
license:apache-2.0
928
13

GLM-4.6

NaNK
license:mit
917
6

LFM2-350M

900
3

Mistral-Small-3.1-24B-Base-2503-unsloth-bnb-4bit

NaNK
license:apache-2.0
897
1

llama-3-70b-bnb-4bit

NaNK
llama
894
45

Llama 2 7b Chat

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory via Unsloth! All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | Gemma 7b | ▶️ Start on Colab | 2.4x faster | 58% less | | Mistral 7b | ▶️ Start on Colab | 2.2x faster | 62% less | | Llama-2 7b | ▶️ Start on Colab | 2.2x faster | 43% less | | TinyLlama | ▶️ Start on Colab | 3.9x faster | 74% less | | CodeLlama 34b A100 | ▶️ Start on Colab | 1.9x faster | 27% less | | Mistral 7b 1xT4 | ▶️ Start on Kaggle | 5x faster\ | 62% less | | DPO - Zephyr | ▶️ Start on Colab | 1.9x faster | 19% less | - This conversational notebook is useful for ShareGPT ChatML / Vicuna templates. - This text completion notebook is for raw text. This DPO notebook replicates Zephyr. - \ Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

NaNK
llama
894
6

Devstral-Small-2505

NaNK
license:apache-2.0
893
21

gemma-2-9b

NaNK
885
13

gemma-3-27b-it-qat-bnb-4bit

NaNK
884
2

Qwen2.5-Coder-3B-Instruct

NaNK
license:apache-2.0
881
6

Phi-4-mini-reasoning-unsloth-bnb-4bit

NaNK
license:mit
873
8

gemma-2-27b-it-bnb-4bit

NaNK
869
11

Qwen2.5-Coder-32B-Instruct-128K-GGUF

NaNK
license:apache-2.0
857
71

Hermes-4-405B-GGUF

> [!NOTE] > Includes Unsloth chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Hermes 4 405B is a frontier, hybrid-mode reasoning model based on Llama-3.1-405B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, interleaved with its reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-405B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-405B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B and 14B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728

NaNK
llama
853
3

gemma-7b-it-bnb-4bit

NaNK
license:apache-2.0
847
15

Qwen2.5-1.5B

NaNK
license:apache-2.0
847
8

Llama 3.1 Nemotron Nano 8B V1 GGUF

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Llama-3.1-Nemotron-Nano-8B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.1-8B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-8B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. It is created from Llama 3.1 8B Instruct and offers improvements in model accuracy. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. Improved using Qwen. This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: Llama-3.3-Nemotron-Super-49B-v1 GOVERNING TERMS: Your use of this model is governed by the NVIDIA Open Model License. Additional Information: Llama 3.1 Community License Agreement. Built with Llama. Model Dates: Trained between August 2024 and March 2025 Data Freshness: The pretraining data has a cutoff of 2023 per Meta Llama 3.1 8B Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. Balance of model accuracy and compute efficiency (the model fits on a single RTX GPU and can be used locally). - [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949) - [\[2502.00203\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203) Architecture Type: Dense decoder-only Transformer model Llama-3.1-Nemotron-Nano-8B-v1 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported. Input: - Input Type: Text - Input Format: String - Input Parameters: One-Dimensional (1D) - Other Properties Related to Input: Context length up to 131,072 tokens Output: - Output Type: Text - Output Format: String - Output Parameters: One-Dimensional (1D) - Other Properties Related to Output: Context length up to 131,072 tokens Software Integration - Runtime Engine: NeMo 24.12 - Recommended Hardware Microarchitecture Compatibility: - NVIDIA Hopper - NVIDIA Ampere 1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt 2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode 3. We recommend using greedy decoding for Reasoning OFF mode 4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required 5. The model will include ` ` if no reasoning was necessary in Reasoning ON model, this is expected behaviour You can try this model out through the preview API, using this link: Llama-3.1-Nemotron-Nano-8B-v1. See the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below. Our code requires the transformers package version to be `4.44.2` or higher. For some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response. - BF16: - 1x RTX 50 Series GPUs - 1x RTX 40 Series GPUs - 1x RTX 30 Series GPUs - 1x H100-80GB GPU - 1x A100-80GB GPU A large variety of training data was used for the post-training pipeline, including manually annotated data and synthetic data. The data for the multi-stage post-training phases for improvements in Code, Math, and Reasoning is a compilation of SFT and RL data that supports improvements of math, code, general reasoning, and instruction following capabilities of the original Llama instruct model. Prompts have been sourced from either public and open corpus or synthetically generated. Responses were synthetically generated by a variety of models, with some prompts containing responses for both Reasoning On and Off modes, to train the model to distinguish between two modes. Data Collection for Training Datasets: Hybrid: Automated, Human, Synthetic We used the datasets listed below to evaluate Llama-3.1-Nemotron-Nano-8B-v1. Data Collection for Evaluation Datasets: Hybrid: Human/Synthetic Data Labeling for Evaluation Datasets: Hybrid: Human/Synthetic/Automatic These results contain both “Reasoning On”, and “Reasoning Off”. We recommend using temperature=`0.6`, topp=`0.95` for “Reasoning On” mode, and greedy decoding for “Reasoning Off” mode. All evaluations are done with 32k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate. > NOTE: Where applicable, a Prompt Template will be provided. While completing benchmarks, please ensure that you are parsing for the correct output format as per the provided prompt in order to reproduce the benchmarks seen below. | Reasoning Mode | Score | |--------------|------------| | Reasoning Off | 7.9 | | Reasoning On | 8.1 | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 36.6% | | Reasoning On | 95.4% | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 0% | | Reasoning On | 47.1% | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 39.4% | | Reasoning On | 54.1% | | Reasoning Mode | Strict:Prompt | Strict:Instruction | |--------------|------------|------------| | Reasoning Off | 74.7% | 82.1% | | Reasoning On | 71.9% | 79.3% | | Reasoning Mode | Score | |--------------|------------| | Reasoning Off | 63.9% | | Reasoning On | 63.6% | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 66.1% | | Reasoning On | 84.6% | NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

NaNK
llama
846
8

Mistral Nemo Instruct 2407 GGUF

license:apache-2.0
838
10

llama-2-13b-bnb-4bit

NaNK
llama
834
6

Qwen2-7B-bnb-4bit

NaNK
license:apache-2.0
833
4

gemma-3-12b-pt

NaNK
826
5

Qwen2-7B-Instruct

NaNK
license:apache-2.0
824
4

Qwen2.5-Math-1.5B

NaNK
license:apache-2.0
824
4

Phi-3-medium-4k-instruct-bnb-4bit

NaNK
license:mit
819
6

Gpt Oss Safeguard 120b GGUF

NaNK
license:apache-2.0
816
2

Magistral-Small-2509-FP8-Dynamic

NaNK
license:apache-2.0
805
6

Qwen2.5-VL-72B-Instruct-bnb-4bit

NaNK
804
15

gemma-2b-it

NaNK
license:apache-2.0
802
3

whisper-small

NaNK
license:apache-2.0
794
4

Magistral-Small-2507

NaNK
license:apache-2.0
792
2

Qwen2.5-VL-32B-Instruct-bnb-4bit

NaNK
license:apache-2.0
791
3

Qwen2-0.5B

NaNK
license:apache-2.0
788
3

LFM2-350M-GGUF

> [!NOTE] > Includes our chat template fixes! For `llama.cpp`, use `--jinja` > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of three post-trained checkpoints with 350M, 700M, and 1.2B parameters. They provide the following key features to create AI-powered edge applications: Fast training & inference – LFM2 achieves 3x faster training compared to its previous generation. It also benefits from 2x faster decode and prefill speed on CPU compared to Qwen3. Best performance – LFM2 outperforms similarly-sized models across multiple benchmark categories, including knowledge, mathematics, instruction following, and multilingual capabilities. New architecture – LFM2 is a new hybrid Liquid model with multiplicative gates and short convolutions. Flexible deployment – LFM2 runs efficiently on CPU, GPU, and NPU hardware for flexible deployment on smartphones, laptops, or vehicles. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | Value | | ------------------- | ----------------------------- | | Parameters | 354,483,968 | | Layers | 16 (10 conv + 6 attn) | | Context length | 32,768 tokens | | Vocabulary size | 65,536 | | Precision | bfloat16 | | Training budget | 10 trillion tokens | | License | LFM Open License v1.0 | Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Chat template: LFM2 uses a ChatML-like chat template as follows: You can apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 10 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Knowledge distillation using LFM1-7B as teacher model Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` from source (v4.54.0.dev0). You can update or install it with the following command: `pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"`. Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT + LoRA | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter in TRL. | | | DPO | Preference alignment with Direct Preference Optimization (DPO) in TRL. | | LFM2 outperforms similar-sized models across different evaluation categories. | Model | MMLU | GPQA | IFEval | IFBench | GSM8K | MGSM | MMMLU | |-------|------|------|--------|---------|-------|------|-------| | LFM2-350M | 43.43 | 27.46 | 65.12 | 16.41 | 30.1 | 29.52 | 37.99 | | LFM2-700M | 49.9 | 28.48 | 72.23 | 20.56 | 46.4 | 45.36 | 43.28 | | LFM2-1.2B | 55.23 | 31.47 | 74.89 | 20.7 | 58.3 | 55.04 | 46.73 | | Qwen3-0.6B | 44.93 | 22.14 | 64.24 | 19.75 | 36.47 | 41.28 | 30.84 | | Qwen3-1.7B | 59.11 | 27.72 | 73.98 | 21.27 | 51.4 | 66.56 | 46.51 | | Llama-3.2-1B-Instruct | 46.6 | 28.84 | 52.39 | 16.86 | 35.71 | 29.12 | 38.15 | | gemma-3-1b-it | 40.08 | 21.07 | 62.9 | 17.72 | 59.59 | 43.6 | 34.43 | If you are interested in custom solutions with edge deployment, please contact our sales team.

774
9

Qwen2.5-3B

NaNK
773
6

llama-2-7b

NaNK
llama
770
19

Qwen3-Next-80B-A3B-Instruct

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next. Qwen3-Next-80B-A3B is the first installment in the Qwen3-Next series and features the following key enchancements: - Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling for ultra-long context length. - High-Sparsity Mixture-of-Experts (MoE): Achieves an extreme low activation ratio in MoE layers, drastically reducing FLOPs per token while preserving model capacity. - Stability Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, and other stabilizing enhancements for robust pre-training and post-training. - Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference. We are seeing strong performance in terms of both parameter efficiency and inference speed for Qwen3-Next-80B-A3B: - Qwen3-Next-80B-A3B-Base outperforms Qwen3-32B-Base on downstream tasks with 10% of the total training cost and with 10 times inference throughput for context over 32K tokens. - Qwen3-Next-80B-A3B-Instruct performs on par with Qwen3-235B-A22B-Instruct-2507 on certain benchmarks, while demonstrating significant advantages in handling ultra-long-context tasks up to 256K tokens. For more details, please refer to our blog post Qwen3-Next. > [!Note] > Qwen3-Next-80B-A3B-Instruct supports only instruct (non-thinking) mode and does not generate `` `` blocks in its output. Qwen3-Next-80B-A3B-Instruct has the following features: - Type: Causal Language Models - Training Stage: Pretraining (15T tokens) & Post-training - Number of Parameters: 80B in total and 3B activated - Number of Paramaters (Non-Embedding): 79B - Hidden Dimension: 2048 - Number of Layers: 48 - Hybrid Layout: 12 \ (3 \ (Gated DeltaNet -> MoE) -> 1 \ (Gated Attention -> MoE)) - Gated Attention: - Number of Attention Heads: 16 for Q and 2 for KV - Head Dimension: 256 - Rotary Position Embedding Dimension: 64 - Gated DeltaNet: - Number of Linear Attention Heads: 32 for V and 16 for QK - Head Dimension: 128 - Mixture of Experts: - Number of Experts: 512 - Number of Activated Experts: 10 - Number of Shared Experts: 1 - Expert Intermediate Dimension: 512 - Context Length: 262,144 natively and extensible up to 1,010,000 tokens | | Qwen3-30B-A3B-Instruct-2507 | Qwen3-32B Non-Thinking | Qwen3-235B-A22B-Instruct-2507 | Qwen3-Next-80B-A3B-Instruct | |--- | --- | --- | --- | --- | | Knowledge | | | | | | MMLU-Pro | 78.4 | 71.9 | 83.0 | 80.6 | | MMLU-Redux | 89.3 | 85.7 | 93.1 | 90.9 | | GPQA | 70.4 | 54.6 | 77.5 | 72.9 | | SuperGPQA | 53.4 | 43.2 | 62.6 | 58.8 | | Reasoning | | | | | | AIME25 | 61.3 | 20.2 | 70.3 | 69.5 | | HMMT25 | 43.0 | 9.8 | 55.4 | 54.1 | | LiveBench 20241125 | 69.0 | 59.8 | 75.4 | 75.8 | | Coding | | | | | | LiveCodeBench v6 (25.02-25.05) | 43.2 | 29.1 | 51.8 | 56.6 | | MultiPL-E | 83.8 | 76.9 | 87.9 | 87.8 | | Aider-Polyglot | 35.6 | 40.0 | 57.3 | 49.8 | | Alignment | | | | | | IFEval | 84.7 | 83.2 | 88.7 | 87.6 | | Arena-Hard v2 | 69.0 | 34.1 | 79.2 | 82.7 | | Creative Writing v3 | 86.0 | 78.3 | 87.5 | 85.3 | | WritingBench | 85.5 | 75.4 | 85.2 | 87.3 | | Agent | | | | | | BFCL-v3 | 65.1 | 63.0 | 70.9 | 70.3 | | TAU1-Retail | 59.1 | 40.1 | 71.3 | 60.9 | | TAU1-Airline | 40.0 | 17.0 | 44.0 | 44.0 | | TAU2-Retail | 57.0 | 48.8 | 74.6 | 57.3 | | TAU2-Airline | 38.0 | 24.0 | 50.0 | 45.5 | | TAU2-Telecom | 12.3 | 24.6 | 32.5 | 13.2 | | Multilingualism | | | | | | MultiIF | 67.9 | 70.7 | 77.5 | 75.8 | | MMLU-ProX | 72.0 | 69.3 | 79.4 | 76.7 | | INCLUDE | 71.9 | 70.9 | 79.5 | 78.9 | | PolyMATH | 43.1 | 22.5 | 50.2 | 45.9 | : For reproducibility, we report the win rates evaluated by GPT-4.1. The code for Qwen3-Next has been merged into the main branch of Hugging Face `transformers`. With earlier versions, you will encounter the following error: The following contains a code snippet illustrating how to use the model generate content based on given inputs. > [!Note] > Multi-Token Prediction (MTP) is not generally available in Hugging Face Transformers. > [!Note] > The efficiency or throughput improvement depends highly on the implementation. > It is recommended to adopt a dedicated inference framework, e.g., SGLang and vLLM, for inference tasks. > [!Tip] > Depending on the inference settings, you may observe better efficiency with `flash-linear-attention` and `causal-conv1d`. > See the links for detailed instructions and requirements. For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-compatible API endpoint. SGLang is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service. `sglang>=0.5.2` is required for Qwen3-Next, which can be installed using: The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs. The following command is recommended for MTP with the rest settings the same as above: > [!Note] > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start. Please also refer to SGLang's usage guide on Qwen3-Next. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM could be used to launch a server with OpenAI-compatible API service. `vllm>=0.10.2` is required for Qwen3-Next, which can be installed using: The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs. The following command is recommended for MTP with the rest settings the same as above: > [!Note] > The default context length is 256K. Consider reducing the context length to a smaller value, e.g., `32768`, if the server fails to start. Please also refer to vLLM's usage guide on Qwen3-Next. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method. YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm` and `sglang`. In general, there are two approaches to enabling YaRN for supported frameworks: - Modifying the model files: In the `config.json` file, add the `ropescaling` fields: > [!NOTE] > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. > We advise adding the `ropescaling` configuration only when processing long contexts is required. > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set `factor` as 2.0. We test the model on an 1M version of the RULER benchmark. | Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k | |---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------| | Qwen3-30B-A3B-Instruct-2507 | 86.8 | 98.0 | 96.7 | 96.9 | 97.2 | 93.4 | 91.0 | 89.1 | 89.8 | 82.5 | 83.6 | 78.4 | 79.7 | 77.6 | 75.7 | 72.8 | | Qwen3-235B-A22B-Instruct-2507 | 92.5 | 98.5 | 97.6 | 96.9 | 97.3 | 95.8 | 94.9 | 93.9 | 94.5 | 91.0 | 92.2 | 90.9 | 87.8 | 84.8 | 86.5 | 84.5 | | Qwen3-Next-80B-A3B-Instruct | 91.8 | 98.5 | 99.0 | 98.0 | 98.7 | 97.6 | 95.0 | 96.0 | 94.0 | 93.5 | 91.7 | 86.9 | 85.5 | 81.7 | 80.3 | 80.3 | Qwen3-Next are evaluated with YaRN enabled. Qwen3-2507 models are evaluated with Dual Chunk Attention enabled. Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each). To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
763
82

Apertus-8B-Instruct-2509-unsloth-bnb-4bit

1. Model Summary 2. How to use 3. Evaluation 4. Training 5. Limitations 6. Legal Aspects Apertus is a 70B and 8B parameter language model designed to push the boundaries of fully-open multilingual and transparent models. The model supports over 1000 languages and long context, it uses only fully compliant and open training data, and achieves comparable performance to models trained behind closed doors. The model is a decoder-only transformer, pretrained on 15T tokens with a staged curriculum of web, code and math data. The model uses a new xIELU activation function and is trained from scratch with the AdEMAMix optimizer. Post-training included supervised fine-tuning and alignment via QRPO. Key features - Fully open model: open weights + open data + full training details including all data and training recipes - Massively Multilingual: 1811 natively supported languages - Compliant Apertus is trained while respecting opt-out consent of data owners (even retrospectivey), and avoiding memorization of training data The modeling code for Apertus is available in transformers `v4.56.0` and later, so make sure to upgrade your transformers version. You can also load the model with the latest `vLLM` which uses transformers as a backend. >[!TIP] > We recommend setting `temperature=0.8` and `topp=0.9` in the sampling parameters. Apertus by default supports a context length up to 65,536 tokens. Deployment of the models is directly supported by the newest versions of Transformers, vLLM, SGLang, and also for running on-device with MLX, Pretraining Evaluation: Performance (%) of Apertus models on general language understanding tasks (higher is better) compared to other pretrained models. | Model | Avg | ARC | HellaSwag | WinoGrande | XNLI | XCOPA | PIQA | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Fully Open Models | | | | | | | | | Apertus-8B | 65.8 | 72.7 | 59.8 | 70.6 | 45.2 | 66.5 | 79.8 | | Apertus-70B | 67.5 | 70.6 | 64.0 | 73.3 | 45.3 | 69.8 | 81.9 | | OLMo2-7B | 64.0 | 72.9 | 60.4 | 74.5 | 40.4 | 55.2 | 80.9 | | OLMo2-32B | 67.7 | 76.2 | 66.7 | 78.6 | 42.9 | 60.1 | 82.1 | | EuroLLM-1.7B | 54.8 | 57.2 | 44.9 | 58.1 | 40.7 | 55.7 | 72.4 | | EuroLLM-9B | 62.8 | 67.9 | 57.9 | 68.8 | 41.5 | 61.1 | 79.6 | | SmolLM2-1.7B | 58.5 | 66.1 | 52.4 | 65.6 | 37.6 | 52.3 | 77.0 | | SmolLM3-3B | 61.6 | 68.6 | 56.4 | 68.1 | 40.5 | 58.2 | 77.7 | | Poro-34B | 61.7 | 65.7 | 57.9 | 70.6 | 41.6 | 56.0 | 78.5 | | Open-Weight Models | | | | | | | | | Llama3.1-8B | 65.4 | 71.6 | 60.0 | 73.4 | 45.3 | 61.8 | 80.1 | | Llama3.1-70B | 67.3 | 74.4 | 56.5 | 79.4 | 44.3 | 66.7 | 82.3 | | Qwen2.5-7B | 64.4 | 69.6 | 60.1 | 72.8 | 43.3 | 61.7 | 78.7 | | Qwen2.5-72B | 69.8 | 76.2 | 67.5 | 78.0 | 46.9 | 68.2 | 82.0 | | Qwen3-32B | 67.8 | 75.6 | 64.0 | 73.8 | 44.4 | 67.9 | 80.9 | | Llama4-Scout-16x17B | 67.9 | 74.7 | 66.8 | 73.2 | 43.5 | 67.7 | 81.2 | | GPT-OSS-20B | 58.1 | 67.0 | 41.5 | 66.5 | 37.4 | 60.4 | 75.6 | Many additional benchmark evaluations, for pretraining and posttraining phases, multilingual evaluations in around hundred languages, and long context evaluations are provided in Section 5 of the ApertusTechReport.pdf - Architecture: Transformer decoder - Pretraining tokens: 15T - Precision: bfloat16 - GPUs: 4096 GH200 - Training Framework: Megatron-LM - ... Open resources All elements used in the training process are made openly available - Training data reconstruction scripts: github.com/swiss-ai/pretrain-data - The training intermediate checkpoints are available on the different branches of this same repository Apertus can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content. EU AI Act Transparency Documentation and Code of Practice - ApertusEUPublicSummary.pdf - ApertusEUCodeofPractice.pdf Data Protection and Copyright Requests For removal requests of personally identifiable information (PII) or of copyrighted content, please contact the respective dataset owners or us directly - [email protected] - [email protected] Output Filter for PII - Currently no output filter is provided. - Please check this site regularly for an output filter that can be used on top of the Apertus LLM. The filter reflects data protection deletion requests which have been addressed to us as the developer of the Apertus LLM. It allows you to remove Personal Data contained in the model output. We strongly advise downloading and applying this output filter from this site every six months. Contact To contact us, please send an email to [email protected]

NaNK
license:apache-2.0
762
1

gemma-7b-it

NaNK
license:apache-2.0
757
9

Phi-4-reasoning-plus-unsloth-bnb-4bit

NaNK
license:mit
757
3

gemma-2-it-GGUF

751
26

Qwen3-8B-Base-bnb-4bit

NaNK
license:apache-2.0
745
2

InternVL3-14B-GGUF

NaNK
license:apache-2.0
743
1

mistral-7b-instruct-v0.2

NaNK
license:apache-2.0
743
0

gpt-oss-safeguard-20b

Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. Try gpt-oss-safeguard · Guide · Model card · OpenAI blog `gpt-oss-safeguard-120b` and `gpt-oss-safeguard-20b` are safety reasoning models built-upon gpt-oss. With these models, you can classify text content based on safety policies that you provide and perform a suite of foundational safety tasks. These models are intended for safety use cases. For other applications, we recommend using gpt-oss models. This model `gpt-oss-safeguard-20b` (21B parameters with 3.6B active parameters) fits into GPUs with 16GB of VRAM. Check out `gpt-oss-safeguard-120b` (117B parameters with 5.1B active parameters) for the larger model. Both models were trained on our harmony response format and should only be used with the harmony format as it will not work correctly otherwise. Trained to reason about safety : Trained and tuned for safety reasoning to accommodate use cases like LLM input-output filtering, online content labeling and offline labeling for Trust and Safety use cases. Bring your own policy: Interprets your written policy, so it generalizes across products and use cases with minimal engineering. Reasoned decisions, not just scores: Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in policy decisions. Keep in mind Raw CoT is meant for developers and safety practitioners. It’s not intended for exposure to general users or use cases outside of safety contexts. Configurable reasoning effort: Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. Permissive Apache 2.0 license: Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. You can use gpt-oss-safeguard-120b and gpt-oss-safeguard-20b similar to gpt-oss-120b and gpt-oss-20b as described in our respective cookbooks. We’ve also provided a detailed prompting guide that provides guidelines for how to craft your policy and use it with the models. To download the model weights from Hugging Face hub using similar instructions to gpt-oss-120b. gpt-oss-safeguard is a model partner of the Robust Open Online Safety Tools (ROOST) Model Community. The ROOST Model Community (RMC) is a group of safety practitioners exploring open source AI models to protect online spaces. As an RMC model partner, OpenAI is committed to incorporating user feedback and jointly iterating on future releases in pursuit of open safety. Visit the RMC GitHub repo to learn more about this partnership and how to get involved.

NaNK
license:apache-2.0
732
1

FLUX.1-dev-GGUF

731
6

DeepSeek-R1-Distill-Llama-70B

NaNK
llama
723
10

Qwen2.5-Coder-0.5B-Instruct-GGUF

NaNK
license:apache-2.0
719
6

Falcon-H1-34B-Instruct-GGUF

NaNK
716
1

LFM2-700M

715
0

GLM Z1 32B 0414 GGUF

> [!NOTE] > If you are using `llama.cpp`, use `--jinja` to enable the system prompt. > Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants. The GLM family welcomes a new generation of open-source models, the GLM-4-32B-0414 series, featuring 32 billion parameters. Its performance is comparable to OpenAI's GPT series and DeepSeek's V3/R1 series, and it supports very user-friendly local deployment features. GLM-4-32B-Base-0414 was pre-trained on 15T of high-quality data, including a large amount of reasoning-type synthetic data, laying the foundation for subsequent reinforcement learning extensions. In the post-training stage, in addition to human preference alignment for dialogue scenarios, we also enhanced the model's performance in instruction following, engineering code, and function calling using techniques such as rejection sampling and reinforcement learning, strengthening the atomic capabilities required for agent tasks. GLM-4-32B-0414 achieves good results in areas such as engineering code, Artifact generation, function calling, search-based Q&A, and report generation. Some benchmarks even rival larger models like GPT-4o and DeepSeek-V3-0324 (671B). GLM-Z1-32B-0414 is a reasoning model with deep thinking capabilities. This was developed based on GLM-4-32B-0414 through cold start and extended reinforcement learning, as well as further training of the model on tasks involving mathematics, code, and logic. Compared to the base model, GLM-Z1-32B-0414 significantly improves mathematical abilities and the capability to solve complex tasks. During the training process, we also introduced general reinforcement learning based on pairwise ranking feedback, further enhancing the model's general capabilities. GLM-Z1-Rumination-32B-0414 is a deep reasoning model with rumination capabilities (benchmarked against OpenAI's Deep Research). Unlike typical deep thinking models, the rumination model employs longer periods of deep thought to solve more open-ended and complex problems (e.g., writing a comparative analysis of AI development in two cities and their future development plans). The rumination model integrates search tools during its deep thinking process to handle complex tasks and is trained by utilizing multiple rule-based rewards to guide and extend end-to-end reinforcement learning. Z1-Rumination shows significant improvements in research-style writing and complex retrieval tasks. Finally, GLM-Z1-9B-0414 is a surprise. We employed the aforementioned series of techniques to train a 9B small-sized model that maintains the open-source tradition. Despite its smaller scale, GLM-Z1-9B-0414 still exhibits excellent capabilities in mathematical reasoning and general tasks. Its overall performance is already at a leading level among open-source models of the same size. Especially in resource-constrained scenarios, this model achieves an excellent balance between efficiency and effectiveness, providing a powerful option for users seeking lightweight deployment. | Parameter | Recommended Value | Description | | ------------ | ----------------- | -------------------------------------------- | | temperature | 0.6 | Balances creativity and stability | | topp | 0.95 | Cumulative probability threshold for sampling| | topk | 40 | Filters out rare tokens while maintaining diversity | | maxnewtokens | 30000 | Leaves enough tokens for thinking | - Add \ \n to the first line: Ensures the model thinks before responding - When using `chattemplate.jinja`, the prompt is automatically injected to enforce this behavior - Retain only the final user-visible reply. Hidden thinking content should not be saved to history to reduce interference—this is already implemented in `chattemplate.jinja` - When input length exceeds 8,192 tokens, consider enabling YaRN (Rope Scaling) - In supported frameworks, add the following snippet to `config.json`: - Static YaRN applies uniformly to all text. It may slightly degrade performance on short texts, so enable as needed. If you find our work useful, please consider citing the following paper.

NaNK
license:mit
696
3

SmolLM2-135M-Instruct-bnb-4bit

NaNK
llama
695
3

Qwen2.5-3B-bnb-4bit

NaNK
690
3

Qwen2.5-Coder-3B-Instruct-128K-GGUF

NaNK
license:apache-2.0
672
13

gemma-3-12b-pt-unsloth-bnb-4bit

NaNK
663
2

InternVL3-14B-Instruct-GGUF

NaNK
license:apache-2.0
662
4

Qwen2.5-Coder-0.5B-bnb-4bit

NaNK
license:apache-2.0
659
1

Qwen2-7B

NaNK
license:apache-2.0
658
6